perf: use the first page a results when `query(api_method="QUERY")` #1723

tswast · 2023-11-15T20:21:44Z

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

This is a restoration of part of #374

TODO:

Unit tests to verify that no extra jobs.getQueryResults or jobs.get calls happen when iterating over rows that are fully returned from jobs.query.
Ensure to_pandas and to_arrow work with this code path.

Closes #589 🦕

google/cloud/bigquery/table.py

…_is_completely_cached

Linchin · 2023-11-16T21:24:09Z

I'm not sure I understand. Could you explain why doing this will improve query performance, and only applies when query(api_method="QUERY")?

tswast · 2023-11-16T22:24:24Z

Could you explain why doing this will improve query performance, and only applies when query(api_method="QUERY")?

Great question @Linchin.

There are two APIs for issuing a query in BigQuery: jobs.insert and jobs.query. There are some key differences between these APIs.

jobs.insert: creates and returns a query job unless it's considered a "bad request" or insufficient permissions or similar. In general this API returns relatively quickly, but only includes job metadata, no query results.

When running a query this way, we call jobs.getQueryResults with the page size set to 0 results. This is purely to find out when the job finishes. The reason for 0 results is that for large rows, getQueryResults can actually hang well past the expected 10 seconds, causing issues such as simple query hangs in 0.14.1, works in 0.13.3 python-bigquery-pandas#343.

Once the query has finished, we call jobs.get to refresh the job metadata. This ensures we have a destination table, which can be used with tabledata.list or the BQ Storage Read API to fetch the results. Many years ago, it used to be much faster to use tabledata.list instead of jobs.getQueryResults to download data, but that changed in late 2019 when many optimizations were made to getQueryResults on the backend. As of perf: use jobs.getQueryResults to download result sets #363, we use either jobs.getQueryResults or the BQ Storage Read API to download results.

In general, this path works best for large query results, which we expect from folks using pandas, for example.

Edit: To clarify, after this PR, I think it'd be OK to use jobs.query even for pandas, but since the cached first page of results increases memory usage (increased memory usage in 2.4.0 #394), I don't think it makes sense to change the default. But definitely something to consider since users who want less memory usage could explicitly set api_method="INSERT".
jobs.query: starts the query and waits. If it finishes within about 10 seconds, it returns the results in that same API call. If the query has not finished, we use the job ID from the response to call jobs.getQueryResults. The subsequent steps are the same as with jobs.insert.

The key difference here is that jobs.query can actually return results in the same request/response as the one to start the query, whereas jobs.insert always requires a subsequent API call to fetch the results.

This PR does a few things: (1) if the job has finished, don't make the unnecessary call to getQueryResults (2), actually use the rows returned from jobs.query in RowIterator, and (3) avoid using the BQ Storage Read API if we know only 1 or 2 more API requests are needed to download the rest of the data.

…_is_completely_cached

Linchin · 2023-11-20T22:27:03Z

google/cloud/bigquery/_job_helpers.py

-        # This also requires updates to `to_dataframe` and the DB API connector
-        # so that they don't try to read from a destination table if all the
-        # results are present.
+        query_job._query_results = google.cloud.bigquery.query._QueryResults(


Does this change mean we also load the query results when job is complete, when query(api_method="INSERT")?

Good question. I just double-checked that this method is only called from query_jobs_query, so it won't affect when api_method="INSERT"

Linchin

Thank you for helping me understand this PR :)

perf: use the first page a results when query(api_method="QUERY")

b8c583a

product-auto-label bot added size: s Pull request size is small. api: bigquery Issues related to the googleapis/python-bigquery API. labels Nov 15, 2023

tswast mentioned this pull request Nov 15, 2023

feat: add Client.query_and_wait which directly returns a RowIterator of results #1722

Merged

4 tasks

add tests

6a8059d

product-auto-label bot added size: m Pull request size is medium. and removed size: s Pull request size is small. labels Nov 15, 2023

tswast marked this pull request as ready for review November 15, 2023 22:54

tswast requested review from a team as code owners November 15, 2023 22:54

tswast requested a review from farhan0102 November 15, 2023 22:54

shollyman requested review from chalmerlowe and Linchin and removed request for farhan0102 November 15, 2023 22:56

tswast added 2 commits November 16, 2023 09:28

respect max_results with cached page

1f0e38e

respect page_size, also avoid bqstorage if almost fully downloaded

4401725

product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Nov 16, 2023

tswast mentioned this pull request Nov 16, 2023

Provide support for synchronous queries through the v2/projects/{projectId}/queries endpoint #589

Closed

skip true test if bqstorage not installed

d078941

tswast commented Nov 16, 2023

View reviewed changes

google/cloud/bigquery/table.py Show resolved Hide resolved

tswast commented Nov 16, 2023

View reviewed changes

google/cloud/bigquery/table.py Show resolved Hide resolved

tswast added 2 commits November 16, 2023 15:02

Merge remote-tracking branch 'origin/main' into issue589-RowIterator.…

660aa76

…_is_completely_cached

coverage

476bcd7

Merge remote-tracking branch 'origin/main' into issue589-RowIterator.…

05d6a3e

…_is_completely_cached

Linchin reviewed Nov 20, 2023

View reviewed changes

Linchin approved these changes Nov 20, 2023

View reviewed changes

tswast merged commit 6290517 into main Nov 21, 2023

tswast deleted the issue589-RowIterator._is_completely_cached branch November 21, 2023 15:13

release-please bot mentioned this pull request Nov 18, 2023

chore(main): release 3.14.0 #1709

Merged

This was referenced Dec 5, 2023

December 04, 2023 kitta65/bq-extension-vscode#257

Closed

December 04, 2023 kitta65/prettier-plugin-bq#264

Closed

December 04, 2023 kitta65/bq2cst#274

Closed

r1b mentioned this pull request Dec 6, 2023

dbapi: Skip storage client fetch when results cached #1745

Closed

This was referenced Dec 19, 2023

December 18, 2023 kitta65/bq-extension-vscode#262

Closed

December 18, 2023 kitta65/prettier-plugin-bq#269

Closed

December 18, 2023 kitta65/bq2cst#279

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: use the first page a results when `query(api_method="QUERY")` #1723

perf: use the first page a results when `query(api_method="QUERY")` #1723

tswast commented Nov 15, 2023 •

edited

Loading

Linchin commented Nov 16, 2023

tswast commented Nov 16, 2023 •

edited

Loading

Linchin Nov 20, 2023

tswast Nov 21, 2023

Linchin left a comment

perf: use the first page a results when query(api_method="QUERY") #1723

perf: use the first page a results when query(api_method="QUERY") #1723

Conversation

tswast commented Nov 15, 2023 • edited Loading

Linchin commented Nov 16, 2023

tswast commented Nov 16, 2023 • edited Loading

Linchin Nov 20, 2023

Choose a reason for hiding this comment

tswast Nov 21, 2023

Choose a reason for hiding this comment

Linchin left a comment

Choose a reason for hiding this comment

perf: use the first page a results when `query(api_method="QUERY")` #1723

perf: use the first page a results when `query(api_method="QUERY")` #1723

tswast commented Nov 15, 2023 •

edited

Loading

tswast commented Nov 16, 2023 •

edited

Loading