-
Notifications
You must be signed in to change notification settings - Fork 48
feat: add allow_large_results to peek #1448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
"To prevent downloading excessive data, it is recommended to limit the data " | ||
"fetched with methods like .head() or .sample() before proceeding with downloads." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be better to point folks at peek().
"To prevent downloading excessive data, it is recommended to limit the data " | |
"fetched with methods like .head() or .sample() before proceeding with downloads." | |
"To prevent downloading excessive data, it is recommended to use the peek() method, " | |
"or limit the data with methods like .head() or .sample() before proceeding with downloads." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
if not value: | ||
warnings.warn(LARGE_RESULTS_WARNING_MESSAGE, UserWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the setter makes the most sense for this warning. We want to encourage people to use this, not discourage them. I'd rather see it when they call a function that initiates query_and_wait
and sampling has been requested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
@@ -35,6 +35,12 @@ | |||
|
|||
UNKNOWN_LOCATION_MESSAGE = "The location '{location}' is set to an unknown value. Did you mean '{possibility}'?" | |||
|
|||
LARGE_RESULTS_WARNING_MESSAGE = ( | |||
"Sampling is disabled because 'allow_large_results' is set to False. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the "head" style sampling still supported, just not sampling by number of bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now the "head" sampling is size_based like others, user cannot decide how many rows they want to download.
bigframes/core/blocks.py
Outdated
# Since we cannot acquire the table size without a query_job, | ||
# we skip the sampling. | ||
fraction = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here seems like a good place for the warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, just feel a little weird that the warning message shown when people already started downloading without sampling.
bigframes/session/executor.py
Outdated
assert query_job is not None and query_job.destination is not None | ||
table = self.bqclient.get_table(query_job.destination) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why we don't use the destination
argument for this instead of fetching from the job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, seems using destination is more straight forward, update, thanks.
job_config = bigquery.QueryJobConfig() | ||
# Use explicit destination to avoid 10GB limit of temporary table | ||
if use_explicit_destination: | ||
destination_table = self.storage_manager.create_temp_table( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it needs to be a full table, right? We should be able to avoid the tables.create
call. _random_table
seems more appropriate (
def _random_table(self, skip_cleanup: bool = False) -> bigquery.TableReference: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the main reason is we need to set an expiration time for temp table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good point.
# Direct call to_pandas uses global default setting (allow_large_results=True), | ||
# table has 'bqdf' prefix. | ||
scalars_df_index.to_pandas() | ||
assert scalars_df_index._query_job.destination.table_id.startswith("bqdf") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it would still be useful to have such a test. Why are we removing this part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's removed because it's not the main thing we test, added it back.
tests/system/small/test_dataframe.py
Outdated
# The metrics won't be updated when we call query_and_wait. | ||
assert session._metrics.execution_count == execution_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we could still track the number of queries we execute, right? Shouldn't need a job object for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated metric to count the execution if query_job is None.
Tests updated.
api_timeout=timeout, | ||
) | ||
if metrics is not None: | ||
metrics.count_job_stats() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query response includes totalBytesProcessed
. Let's create a follow-up issue to include those metrics, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed internal issue 400961399
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕