-
Notifications
You must be signed in to change notification settings - Fork 48
feat: Support dry_run in to_pandas()
#1436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
00eed4b
to
fe82c6d
Compare
Heads up: I suspect we might encounter some conflicts after #1448 merges. I think that |
bigframes/core/blocks.py
Outdated
df = pd.DataFrame( | ||
data={ | ||
"dry_run_stats": [ | ||
*self.dtypes, | ||
tuple(index_types) if len(index_types) > 1 else index_types[0], | ||
query_job.total_bytes_processed, | ||
] | ||
}, | ||
index=[*self.column_labels, "[index]", "total_bytes_processed"], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this pretty confusing. I think it would be more typical to have the job properties as separate columns.
Alternatively, we could return a pd.Series
if we do want a 1 dimension object.
bigframes/core/blocks.py
Outdated
query_job.total_bytes_processed, | ||
] | ||
}, | ||
index=[*self.column_labels, "[index]", "total_bytes_processed"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why always [index]
? Sometimes the index has a name. Also, sometimes the index is a multi-index.
bigframes/core/blocks.py
Outdated
df = pd.DataFrame( | ||
data={ | ||
"dry_run_stats": [ | ||
*self.dtypes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a bit risky to me. What if the columns has the name total_bytes_processed
? I would rather see dtypes
as it's own object column/row that contains the predicted dtypes as a single object.
bigframes/core/blocks.py
Outdated
data={ | ||
"dry_run_stats": [ | ||
*self.dtypes, | ||
tuple(index_types) if len(index_types) > 1 else index_types[0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, maybe introduce index_dtypes
and have that be the whole object of 0+ dtypes for a potential multi-index.
I used a DataFrame for dry run stats, just like what It looks like this: https://screenshot.googleplex.com/gEEBU7aMqUy93SB Plus, I applied some "@overload" magic to make Let me know your thoughts @tswast |
Here's what I had in mind: import copy
import pandas as pd
from google.colab import auth
from google.cloud import bigquery
# Authenticate the user
auth.authenticate_user()
# Initialize a BigQuery client
client = bigquery.Client(project='bigframes-dev') # Replace with your project ID
job_config = bigquery.QueryJobConfig()
job_config.dry_run = True
query = """
SELECT
name,
SUM(number) AS total
FROM
`bigquery-public-data.usa_names.usa_1910_2013`
GROUP BY
name
ORDER BY
total DESC
LIMIT
10;
"""
job = client.query(query, job_config=job_config)
job_api_repr = copy.deepcopy(job._properties)
print(job_api_repr)
index = []
values = []
query_configuration = job_api_repr['configuration'].pop("query")
index.extend(query_configuration.keys())
values.extend(query_configuration.values())
query_statistics = job_api_repr['statistics'].pop("query")
index.extend(query_statistics.keys())
values.extend(query_statistics.values())
remaining_statistics = job_api_repr['statistics']
index.extend(remaining_statistics.keys())
values.extend(remaining_statistics.values())
series = pd.Series(values, index=index)
print(series) |
Re: #1436 (comment) It would need
|
Now the stats look like these: https://screenshot.googleplex.com/BhaUb8uTYrdz7iM I hard-coded all the keys for value lookup in the dry run job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please find another way to make the type checker happy. I dislike the repeated code.
bigframes/core/blocks.py
Outdated
@@ -549,6 +582,11 @@ def to_pandas( | |||
else: | |||
sampling = sampling.with_disabled() | |||
|
|||
if dry_run: | |||
if sampling.enable_downsampling: | |||
raise NotImplementedError("Dry run with sampling is not supproted") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise NotImplementedError("Dry run with sampling is not supproted") | |
raise NotImplementedError("Dry run with sampling is not supported") |
bigframes/core/blocks.py
Outdated
series, query_job = self._block.select_columns([]).to_pandas( | ||
ordered=ordered, | ||
allow_large_results=allow_large_results, | ||
dry_run=dry_run, | ||
) | ||
return series, query_job | ||
|
||
df, query_job = self._block.select_columns([]).to_pandas( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The select_columns([])
confuses me, but I see that was here before. Please refactor these a bit so that self._block.select_columns([])
is saved to a variable since it is in common with both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we can get rid of this if
statement and rename the variable df_or_series
.
bigframes/core/blocks.py
Outdated
if dry_run: | ||
series, query_job = self._block.select_columns([]).to_pandas( | ||
ordered=ordered, | ||
allow_large_results=allow_large_results, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think allow_large_results
shouldn't have an effect on dry run queries, as that controls the destination table property.
bigframes/core/blocks.py
Outdated
df, query_job = self._block.select_columns([]).to_pandas( | ||
ordered=ordered, allow_large_results=allow_large_results | ||
ordered=ordered, allow_large_results=allow_large_results, dry_run=dry_run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why include the dry_run
argument here if we know it's false?
bigframes/core/blocks.py
Outdated
df, query_job = self._block.select_columns([]).to_pandas( | ||
ordered=ordered, allow_large_results=allow_large_results | ||
ordered=ordered, allow_large_results=allow_large_results, dry_run=dry_run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why include the dry_run
argument here if we know it's false?
bigframes/core/indexes/base.py
Outdated
self._query_job = query_job | ||
return series | ||
|
||
# Repeat the to_pandas() call to make mypy deduce type correctly, because mypy cannot resolve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you just use bool
consistently, then?
Yeah that's true... I moved the |
Presubmits failed:
Hard to say if it's related to this PR. |
* feat: Support dry_run in * centralize dry_run logics at block level * fix lint errors * remove unnecessary code * use dataframe for dry_run stats * flatten the job stats to a series * fix lint * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * fix query job issue * Make pandas surface directly call block._compute_dry_run * type hint update --------- Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
No description provided.