Skip to content

feat: Support dry_run in to_pandas() #1436

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 32 commits into from
Mar 19, 2025
Merged

feat: Support dry_run in to_pandas() #1436

merged 32 commits into from
Mar 19, 2025

Conversation

sycai
Copy link
Contributor

@sycai sycai commented Feb 27, 2025

No description provided.

@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Feb 27, 2025
@sycai sycai force-pushed the sycai_to_pandas_dry_run branch from 00eed4b to fe82c6d Compare February 28, 2025 22:05
@sycai sycai marked this pull request as ready for review February 28, 2025 23:19
@sycai sycai requested review from a team as code owners February 28, 2025 23:19
@sycai sycai requested a review from TrevorBergeron February 28, 2025 23:19
@sycai sycai requested a review from tswast February 28, 2025 23:20
@tswast
Copy link
Collaborator

tswast commented Mar 5, 2025

Heads up: I suspect we might encounter some conflicts after #1448 merges. I think that allow_large_results change is more urgent, so let's refrain from merging this until that's in.

Comment on lines 815 to 824
df = pd.DataFrame(
data={
"dry_run_stats": [
*self.dtypes,
tuple(index_types) if len(index_types) > 1 else index_types[0],
query_job.total_bytes_processed,
]
},
index=[*self.column_labels, "[index]", "total_bytes_processed"],
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this pretty confusing. I think it would be more typical to have the job properties as separate columns.

Alternatively, we could return a pd.Series if we do want a 1 dimension object.

query_job.total_bytes_processed,
]
},
index=[*self.column_labels, "[index]", "total_bytes_processed"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why always [index]? Sometimes the index has a name. Also, sometimes the index is a multi-index.

df = pd.DataFrame(
data={
"dry_run_stats": [
*self.dtypes,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit risky to me. What if the columns has the name total_bytes_processed? I would rather see dtypes as it's own object column/row that contains the predicted dtypes as a single object.

data={
"dry_run_stats": [
*self.dtypes,
tuple(index_types) if len(index_types) > 1 else index_types[0],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, maybe introduce index_dtypes and have that be the whole object of 0+ dtypes for a potential multi-index.

@sycai
Copy link
Contributor Author

sycai commented Mar 6, 2025

I used a DataFrame for dry run stats, just like what describe() does.

It looks like this: https://screenshot.googleplex.com/gEEBU7aMqUy93SB

Plus, I applied some "@overload" magic to make mypy happy.

Let me know your thoughts @tswast

@sycai sycai requested a review from tswast March 6, 2025 01:44
@tswast
Copy link
Collaborator

tswast commented Mar 6, 2025

Here's what I had in mind:

import copy

import pandas as pd
from google.colab import auth
from google.cloud import bigquery

# Authenticate the user
auth.authenticate_user()

# Initialize a BigQuery client
client = bigquery.Client(project='bigframes-dev') # Replace with your project ID

job_config = bigquery.QueryJobConfig()
job_config.dry_run = True

query = """
SELECT
  name,
  SUM(number) AS total
FROM
  `bigquery-public-data.usa_names.usa_1910_2013`
GROUP BY
  name
ORDER BY
  total DESC
LIMIT
  10;
"""

job = client.query(query, job_config=job_config)

job_api_repr = copy.deepcopy(job._properties)
print(job_api_repr)
index = []
values = []
query_configuration = job_api_repr['configuration'].pop("query")
index.extend(query_configuration.keys())
values.extend(query_configuration.values())

query_statistics = job_api_repr['statistics'].pop("query")
index.extend(query_statistics.keys())
values.extend(query_statistics.values())

remaining_statistics = job_api_repr['statistics']
index.extend(remaining_statistics.keys())
values.extend(remaining_statistics.values())

series = pd.Series(values, index=index)
print(series)

@tswast
Copy link
Collaborator

tswast commented Mar 6, 2025

Re: #1436 (comment)

It would need

  • some careful checks in case some of those keys don't exist,
  • maybe we want an allowlist of keys so we only expose the job stats we've validated are useful to a bigframes user,
  • add dtypes and index_dtypes for the expected output types (possibly remove schema).

@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Mar 6, 2025
@sycai
Copy link
Contributor Author

sycai commented Mar 6, 2025

Now the stats look like these: https://screenshot.googleplex.com/BhaUb8uTYrdz7iM

I hard-coded all the keys for value lookup in the dry run job.

@tswast

Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please find another way to make the type checker happy. I dislike the repeated code.

@@ -549,6 +582,11 @@ def to_pandas(
else:
sampling = sampling.with_disabled()

if dry_run:
if sampling.enable_downsampling:
raise NotImplementedError("Dry run with sampling is not supproted")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise NotImplementedError("Dry run with sampling is not supproted")
raise NotImplementedError("Dry run with sampling is not supported")

Comment on lines 2816 to 2823
series, query_job = self._block.select_columns([]).to_pandas(
ordered=ordered,
allow_large_results=allow_large_results,
dry_run=dry_run,
)
return series, query_job

df, query_job = self._block.select_columns([]).to_pandas(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The select_columns([]) confuses me, but I see that was here before. Please refactor these a bit so that self._block.select_columns([]) is saved to a variable since it is in common with both.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we can get rid of this if statement and rename the variable df_or_series.

if dry_run:
series, query_job = self._block.select_columns([]).to_pandas(
ordered=ordered,
allow_large_results=allow_large_results,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think allow_large_results shouldn't have an effect on dry run queries, as that controls the destination table property.

df, query_job = self._block.select_columns([]).to_pandas(
ordered=ordered, allow_large_results=allow_large_results
ordered=ordered, allow_large_results=allow_large_results, dry_run=dry_run
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why include the dry_run argument here if we know it's false?

df, query_job = self._block.select_columns([]).to_pandas(
ordered=ordered, allow_large_results=allow_large_results
ordered=ordered, allow_large_results=allow_large_results, dry_run=dry_run
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why include the dry_run argument here if we know it's false?

self._query_job = query_job
return series

# Repeat the to_pandas() call to make mypy deduce type correctly, because mypy cannot resolve
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you just use bool consistently, then?

@sycai
Copy link
Contributor Author

sycai commented Mar 13, 2025

Please find another way to make the type checker happy. I dislike the repeated code.

Yeah that's true... I moved the if dry_run then block._compute_dry_run code from the block.to_pandas() to DataFrames/Series/Index.to_pandas(), that should clean up the code a bit. Plus, no more signature override to block.to_pandas()

@sycai sycai requested a review from tswast March 13, 2025 22:49
@tswast
Copy link
Collaborator

tswast commented Mar 19, 2025

Presubmits failed:

FAILED tests/system/small/test_dataframe.py::test_df_peek[partial] - assert (...
FAILED tests/system/small/test_dataframe.py::test_df_peek[strict] - assert (6...

Hard to say if it's related to this PR.

@sycai sycai merged commit 75fc7e0 into main Mar 19, 2025
24 checks passed
@sycai sycai deleted the sycai_to_pandas_dry_run branch March 19, 2025 22:11
shobsi pushed a commit that referenced this pull request Mar 28, 2025
* feat: Support dry_run in

* centralize dry_run logics at block level

* fix lint errors

* remove unnecessary code

* use dataframe for dry_run stats

* flatten the job stats to a series

* fix lint

* 🦉 Updates from OwlBot post-processor

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

* fix query job issue

* Make pandas surface directly call block._compute_dry_run

* type hint update

---------

Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants