Skip to content

feat: add allow_large_results to peek #1448

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Mar 5, 2025
Merged

Conversation

Genesis929
Copy link
Collaborator

@Genesis929 Genesis929 commented Mar 4, 2025

  1. Added allow_large_results to peek
  2. Switched allow_large_results to query_and_wait.
  3. Added warning messages to metrics and allow_large_results option to warn user that we won't have metrics and won't sampling.
  4. skip sampling when allow_large_results=False as we can't get the table size.

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Mar 4, 2025
@Genesis929 Genesis929 added the owlbot:run Add this label to trigger the Owlbot post processor. label Mar 4, 2025
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Mar 4, 2025
@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Mar 4, 2025
@Genesis929 Genesis929 added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Mar 4, 2025
@Genesis929 Genesis929 marked this pull request as ready for review March 4, 2025 08:23
@Genesis929 Genesis929 requested review from a team as code owners March 4, 2025 08:23
@Genesis929 Genesis929 requested a review from GarrettWu March 4, 2025 08:23
@bigframes-bot bigframes-bot removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Mar 4, 2025
@Genesis929 Genesis929 requested a review from tswast March 4, 2025 08:23
@Genesis929 Genesis929 added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Mar 4, 2025
@bigframes-bot bigframes-bot removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Mar 4, 2025
@Genesis929 Genesis929 added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Mar 4, 2025
@bigframes-bot bigframes-bot removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Mar 4, 2025
Comment on lines 40 to 41
"To prevent downloading excessive data, it is recommended to limit the data "
"fetched with methods like .head() or .sample() before proceeding with downloads."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to point folks at peek().

Suggested change
"To prevent downloading excessive data, it is recommended to limit the data "
"fetched with methods like .head() or .sample() before proceeding with downloads."
"To prevent downloading excessive data, it is recommended to use the peek() method, "
"or limit the data with methods like .head() or .sample() before proceeding with downloads."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Comment on lines 264 to 265
if not value:
warnings.warn(LARGE_RESULTS_WARNING_MESSAGE, UserWarning)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the setter makes the most sense for this warning. We want to encourage people to use this, not discourage them. I'd rather see it when they call a function that initiates query_and_wait and sampling has been requested.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@@ -35,6 +35,12 @@

UNKNOWN_LOCATION_MESSAGE = "The location '{location}' is set to an unknown value. Did you mean '{possibility}'?"

LARGE_RESULTS_WARNING_MESSAGE = (
"Sampling is disabled because 'allow_large_results' is set to False. "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the "head" style sampling still supported, just not sampling by number of bytes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now the "head" sampling is size_based like others, user cannot decide how many rows they want to download.

Comment on lines 630 to 632
# Since we cannot acquire the table size without a query_job,
# we skip the sampling.
fraction = 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here seems like a good place for the warning.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, just feel a little weird that the warning message shown when people already started downloading without sampling.

Comment on lines 332 to 333
assert query_job is not None and query_job.destination is not None
table = self.bqclient.get_table(query_job.destination)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why we don't use the destination argument for this instead of fetching from the job.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, seems using destination is more straight forward, update, thanks.

job_config = bigquery.QueryJobConfig()
# Use explicit destination to avoid 10GB limit of temporary table
if use_explicit_destination:
destination_table = self.storage_manager.create_temp_table(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it needs to be a full table, right? We should be able to avoid the tables.create call. _random_table seems more appropriate (

def _random_table(self, skip_cleanup: bool = False) -> bigquery.TableReference:
)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the main reason is we need to set an expiration time for temp table.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point.

Comment on lines 256 to 259
# Direct call to_pandas uses global default setting (allow_large_results=True),
# table has 'bqdf' prefix.
scalars_df_index.to_pandas()
assert scalars_df_index._query_job.destination.table_id.startswith("bqdf")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it would still be useful to have such a test. Why are we removing this part?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's removed because it's not the main thing we test, added it back.

Comment on lines 682 to 683
# The metrics won't be updated when we call query_and_wait.
assert session._metrics.execution_count == execution_count
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we could still track the number of queries we execute, right? Shouldn't need a job object for that.

Copy link
Collaborator Author

@Genesis929 Genesis929 Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated metric to count the execution if query_job is None.

Tests updated.

@Genesis929 Genesis929 requested a review from tswast March 5, 2025 01:04
api_timeout=timeout,
)
if metrics is not None:
metrics.count_job_stats()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query response includes totalBytesProcessed. Let's create a follow-up issue to include those metrics, too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed internal issue 400961399

@tswast tswast enabled auto-merge (squash) March 5, 2025 17:19
@tswast tswast merged commit 67487b9 into main Mar 5, 2025
23 checks passed
@tswast tswast deleted the large_results_update_huanc branch March 5, 2025 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants