Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: use the first page a results when query(api_method="QUERY") #1723

Merged
merged 8 commits into from
Nov 21, 2023

Conversation

tswast
Copy link
Contributor

@tswast tswast commented Nov 15, 2023

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

This is a restoration of part of #374

TODO:

  • Unit tests to verify that no extra jobs.getQueryResults or jobs.get calls happen when iterating over rows that are fully returned from jobs.query.
  • Ensure to_pandas and to_arrow work with this code path.

Closes #589 🦕

@product-auto-label product-auto-label bot added size: s Pull request size is small. api: bigquery Issues related to the googleapis/python-bigquery API. labels Nov 15, 2023
@product-auto-label product-auto-label bot added size: m Pull request size is medium. and removed size: s Pull request size is small. labels Nov 15, 2023
@tswast tswast marked this pull request as ready for review November 15, 2023 22:54
@tswast tswast requested review from a team as code owners November 15, 2023 22:54
@tswast tswast requested a review from farhan0102 November 15, 2023 22:54
@shollyman shollyman requested review from chalmerlowe and Linchin and removed request for farhan0102 November 15, 2023 22:56
@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Nov 16, 2023
@Linchin
Copy link
Contributor

Linchin commented Nov 16, 2023

I'm not sure I understand. Could you explain why doing this will improve query performance, and only applies when query(api_method="QUERY")?

@tswast
Copy link
Contributor Author

tswast commented Nov 16, 2023

Could you explain why doing this will improve query performance, and only applies when query(api_method="QUERY")?

Great question @Linchin.

There are two APIs for issuing a query in BigQuery: jobs.insert and jobs.query. There are some key differences between these APIs.

  • jobs.insert: creates and returns a query job unless it's considered a "bad request" or insufficient permissions or similar. In general this API returns relatively quickly, but only includes job metadata, no query results.

    When running a query this way, we call jobs.getQueryResults with the page size set to 0 results. This is purely to find out when the job finishes. The reason for 0 results is that for large rows, getQueryResults can actually hang well past the expected 10 seconds, causing issues such as simple query hangs in 0.14.1, works in 0.13.3 python-bigquery-pandas#343.

    Once the query has finished, we call jobs.get to refresh the job metadata. This ensures we have a destination table, which can be used with tabledata.list or the BQ Storage Read API to fetch the results. Many years ago, it used to be much faster to use tabledata.list instead of jobs.getQueryResults to download data, but that changed in late 2019 when many optimizations were made to getQueryResults on the backend. As of perf: use jobs.getQueryResults to download result sets #363, we use either jobs.getQueryResults or the BQ Storage Read API to download results.

    In general, this path works best for large query results, which we expect from folks using pandas, for example.

    Edit: To clarify, after this PR, I think it'd be OK to use jobs.query even for pandas, but since the cached first page of results increases memory usage (increased memory usage in 2.4.0 #394), I don't think it makes sense to change the default. But definitely something to consider since users who want less memory usage could explicitly set api_method="INSERT".

  • jobs.query: starts the query and waits. If it finishes within about 10 seconds, it returns the results in that same API call. If the query has not finished, we use the job ID from the response to call jobs.getQueryResults. The subsequent steps are the same as with jobs.insert.

    The key difference here is that jobs.query can actually return results in the same request/response as the one to start the query, whereas jobs.insert always requires a subsequent API call to fetch the results.

This PR does a few things: (1) if the job has finished, don't make the unnecessary call to getQueryResults (2), actually use the rows returned from jobs.query in RowIterator, and (3) avoid using the BQ Storage Read API if we know only 1 or 2 more API requests are needed to download the rest of the data.

# This also requires updates to `to_dataframe` and the DB API connector
# so that they don't try to read from a destination table if all the
# results are present.
query_job._query_results = google.cloud.bigquery.query._QueryResults(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change mean we also load the query results when job is complete, when query(api_method="INSERT")?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I just double-checked that this method is only called from query_jobs_query, so it won't affect when api_method="INSERT"

Copy link
Contributor

@Linchin Linchin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for helping me understand this PR :)

@tswast tswast merged commit 6290517 into main Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide support for synchronous queries through the v2/projects/{projectId}/queries endpoint
2 participants