feat: Support dry_run in `to_pandas()` #1436

sycai · 2025-02-27T22:09:39Z

No description provided.

tswast · 2025-03-05T17:28:26Z

Heads up: I suspect we might encounter some conflicts after #1448 merges. I think that allow_large_results change is more urgent, so let's refrain from merging this until that's in.

tswast · 2025-03-05T22:56:16Z

bigframes/core/blocks.py

+        df = pd.DataFrame(
+            data={
+                "dry_run_stats": [
+                    *self.dtypes,
+                    tuple(index_types) if len(index_types) > 1 else index_types[0],
+                    query_job.total_bytes_processed,
+                ]
+            },
+            index=[*self.column_labels, "[index]", "total_bytes_processed"],
+        )


I find this pretty confusing. I think it would be more typical to have the job properties as separate columns.

Alternatively, we could return a pd.Series if we do want a 1 dimension object.

tswast · 2025-03-05T22:57:29Z

bigframes/core/blocks.py

+                    query_job.total_bytes_processed,
+                ]
+            },
+            index=[*self.column_labels, "[index]", "total_bytes_processed"],


Why always [index]? Sometimes the index has a name. Also, sometimes the index is a multi-index.

tswast · 2025-03-05T22:58:57Z

bigframes/core/blocks.py

+        df = pd.DataFrame(
+            data={
+                "dry_run_stats": [
+                    *self.dtypes,


This feels a bit risky to me. What if the columns has the name total_bytes_processed? I would rather see dtypes as it's own object column/row that contains the predicted dtypes as a single object.

tswast · 2025-03-05T22:59:42Z

bigframes/core/blocks.py

+            data={
+                "dry_run_stats": [
+                    *self.dtypes,
+                    tuple(index_types) if len(index_types) > 1 else index_types[0],


Same here, maybe introduce index_dtypes and have that be the whole object of 0+ dtypes for a potential multi-index.

sycai · 2025-03-06T01:43:38Z

I used a DataFrame for dry run stats, just like what describe() does.

It looks like this: https://screenshot.googleplex.com/gEEBU7aMqUy93SB

Plus, I applied some "@overload" magic to make mypy happy.

Let me know your thoughts @tswast

tswast · 2025-03-06T16:51:29Z

Here's what I had in mind:

import copy

import pandas as pd
from google.colab import auth
from google.cloud import bigquery

# Authenticate the user
auth.authenticate_user()

# Initialize a BigQuery client
client = bigquery.Client(project='bigframes-dev') # Replace with your project ID

job_config = bigquery.QueryJobConfig()
job_config.dry_run = True

query = """
SELECT
  name,
  SUM(number) AS total
FROM
  `bigquery-public-data.usa_names.usa_1910_2013`
GROUP BY
  name
ORDER BY
  total DESC
LIMIT
  10;
"""

job = client.query(query, job_config=job_config)

job_api_repr = copy.deepcopy(job._properties)
print(job_api_repr)
index = []
values = []
query_configuration = job_api_repr['configuration'].pop("query")
index.extend(query_configuration.keys())
values.extend(query_configuration.values())

query_statistics = job_api_repr['statistics'].pop("query")
index.extend(query_statistics.keys())
values.extend(query_statistics.values())

remaining_statistics = job_api_repr['statistics']
index.extend(remaining_statistics.keys())
values.extend(remaining_statistics.values())

series = pd.Series(values, index=index)
print(series)

tswast · 2025-03-06T16:54:01Z

Re: #1436 (comment)

It would need

some careful checks in case some of those keys don't exist,
maybe we want an allowlist of keys so we only expose the job stats we've validated are useful to a bigframes user,
add dtypes and index_dtypes for the expected output types (possibly remove schema).

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

sycai · 2025-03-06T23:25:06Z

Now the stats look like these: https://screenshot.googleplex.com/BhaUb8uTYrdz7iM

I hard-coded all the keys for value lookup in the dry run job.

@tswast

tswast

Please find another way to make the type checker happy. I dislike the repeated code.

tswast · 2025-03-13T14:57:14Z

bigframes/core/blocks.py

@@ -549,6 +582,11 @@ def to_pandas(
        else:
            sampling = sampling.with_disabled()

+        if dry_run:
+            if sampling.enable_downsampling:
+                raise NotImplementedError("Dry run with sampling is not supproted")


Suggested change

raise NotImplementedError("Dry run with sampling is not supproted")

raise NotImplementedError("Dry run with sampling is not supported")

tswast · 2025-03-13T15:01:46Z

bigframes/core/blocks.py

+            series, query_job = self._block.select_columns([]).to_pandas(
+                ordered=ordered,
+                allow_large_results=allow_large_results,
+                dry_run=dry_run,
+            )
+            return series, query_job
+
        df, query_job = self._block.select_columns([]).to_pandas(


The select_columns([]) confuses me, but I see that was here before. Please refactor these a bit so that self._block.select_columns([]) is saved to a variable since it is in common with both.

Alternatively, we can get rid of this if statement and rename the variable df_or_series.

tswast · 2025-03-13T15:02:31Z

bigframes/core/blocks.py

+        if dry_run:
+            series, query_job = self._block.select_columns([]).to_pandas(
+                ordered=ordered,
+                allow_large_results=allow_large_results,


I think allow_large_results shouldn't have an effect on dry run queries, as that controls the destination table property.

tswast · 2025-03-13T15:03:00Z

bigframes/core/blocks.py

        df, query_job = self._block.select_columns([]).to_pandas(
-            ordered=ordered, allow_large_results=allow_large_results
+            ordered=ordered, allow_large_results=allow_large_results, dry_run=dry_run


Why include the dry_run argument here if we know it's false?

tswast · 2025-03-13T15:13:24Z

bigframes/core/blocks.py

        df, query_job = self._block.select_columns([]).to_pandas(
-            ordered=ordered, allow_large_results=allow_large_results
+            ordered=ordered, allow_large_results=allow_large_results, dry_run=dry_run


Why include the dry_run argument here if we know it's false?

tswast · 2025-03-13T15:15:06Z

bigframes/core/indexes/base.py

+                self._query_job = query_job
+            return series
+
+        # Repeat the to_pandas() call to make mypy deduce type correctly, because mypy cannot resolve


Why don't you just use bool consistently, then?

bigframes/core/indexes/base.py

sycai · 2025-03-13T21:25:46Z

Please find another way to make the type checker happy. I dislike the repeated code.

Yeah that's true... I moved the if dry_run then block._compute_dry_run code from the block.to_pandas() to DataFrames/Series/Index.to_pandas(), that should clean up the code a bit. Plus, no more signature override to block.to_pandas()

tswast · 2025-03-19T20:51:41Z

Presubmits failed:

FAILED tests/system/small/test_dataframe.py::test_df_peek[partial] - assert (...
FAILED tests/system/small/test_dataframe.py::test_df_peek[strict] - assert (6...

Hard to say if it's related to this PR.

* feat: Support dry_run in * centralize dry_run logics at block level * fix lint errors * remove unnecessary code * use dataframe for dry_run stats * flatten the job stats to a series * fix lint * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * fix query job issue * Make pandas surface directly call block._compute_dry_run * type hint update --------- Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>

feat: Support dry_run in

dac34d7

product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Feb 27, 2025

sycai and others added 7 commits February 27, 2025 14:10

Merge branch 'main' into sycai_to_pandas_dry_run

b88ba73

Merge branch 'main' into sycai_to_pandas_dry_run

330a647

centralize dry_run logics at block level

5f8a76a

fix lint errors

75f4ce1

Merge branch 'main' into sycai_to_pandas_dry_run

0b4c48c

Merge branch 'main' into sycai_to_pandas_dry_run

40c557b

remove unnecessary code

fe82c6d

sycai force-pushed the sycai_to_pandas_dry_run branch from 00eed4b to fe82c6d Compare February 28, 2025 22:05

Merge branch 'main' into sycai_to_pandas_dry_run

1adc96a

sycai marked this pull request as ready for review February 28, 2025 23:19

sycai requested review from a team as code owners February 28, 2025 23:19

sycai requested a review from TrevorBergeron February 28, 2025 23:19

blunderbuss-gcf bot assigned jiaxunwu Feb 28, 2025

sycai requested a review from tswast February 28, 2025 23:20

sycai added 5 commits March 3, 2025 09:51

Merge branch 'main' into sycai_to_pandas_dry_run

3c0efc2

Merge branch 'main' into sycai_to_pandas_dry_run

9c3d849

Merge branch 'main' into sycai_to_pandas_dry_run

725050b

Merge branch 'main' into sycai_to_pandas_dry_run

7550f6a

Merge branch 'main' into sycai_to_pandas_dry_run

3b9ea0e

Merge branch 'main' into sycai_to_pandas_dry_run

70e1986

tswast reviewed Mar 5, 2025

View reviewed changes

sycai and others added 2 commits March 5, 2025 16:24

Merge branch 'main' into sycai_to_pandas_dry_run

c2c3fca

use dataframe for dry_run stats

cde29a0

sycai requested a review from tswast March 6, 2025 01:44

sycai and others added 2 commits March 6, 2025 12:49

Merge branch 'main' into sycai_to_pandas_dry_run

e291c70

flatten the job stats to a series

86bf46b

product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Mar 6, 2025

sycai and others added 3 commits March 6, 2025 15:18

Merge branch 'main' into sycai_to_pandas_dry_run

09fb874

fix lint

4af0ac4

🦉 Updates from OwlBot post-processor

416ad49

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

sycai and others added 5 commits March 7, 2025 01:40

fix query job issue

301e993

Merge branch 'main' into sycai_to_pandas_dry_run

67e40e9

Merge branch 'main' into sycai_to_pandas_dry_run

6eeb69e

Merge branch 'main' into sycai_to_pandas_dry_run

5a85ad5

Merge branch 'main' into sycai_to_pandas_dry_run

e11ccdb

tswast reviewed Mar 13, 2025

View reviewed changes

sycai and others added 3 commits March 13, 2025 12:45

Merge branch 'main' into sycai_to_pandas_dry_run

c610e57

Make pandas surface directly call block._compute_dry_run

b4db897

type hint update

1401076

sycai requested a review from tswast March 13, 2025 22:49

Merge branch 'main' into sycai_to_pandas_dry_run

fb9f8bf

Merge branch 'main' into sycai_to_pandas_dry_run

30b9f3d

tswast approved these changes Mar 19, 2025

View reviewed changes

sycai merged commit 75fc7e0 into main Mar 19, 2025
24 checks passed

sycai deleted the sycai_to_pandas_dry_run branch March 19, 2025 22:11

release-please bot mentioned this pull request Mar 19, 2025

chore(main): release 1.42.0 #1508

Merged

release-please bot mentioned this pull request Mar 28, 2025

chore(v1): release 1.42.0 #1567

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support dry_run in `to_pandas()` #1436

feat: Support dry_run in `to_pandas()` #1436

sycai commented Feb 27, 2025

tswast commented Mar 5, 2025

tswast Mar 5, 2025

tswast Mar 5, 2025

tswast Mar 5, 2025

tswast Mar 5, 2025

sycai commented Mar 6, 2025

tswast commented Mar 6, 2025 •

edited

Loading

tswast commented Mar 6, 2025

sycai commented Mar 6, 2025 •

edited

Loading

tswast left a comment

tswast Mar 13, 2025

tswast Mar 13, 2025

tswast Mar 13, 2025

tswast Mar 13, 2025

tswast Mar 13, 2025

tswast Mar 13, 2025

tswast Mar 13, 2025

sycai commented Mar 13, 2025

tswast commented Mar 19, 2025

	raise NotImplementedError("Dry run with sampling is not supproted")
	raise NotImplementedError("Dry run with sampling is not supported")

feat: Support dry_run in to_pandas() #1436

feat: Support dry_run in to_pandas() #1436

Conversation

sycai commented Feb 27, 2025

tswast commented Mar 5, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sycai commented Mar 6, 2025

tswast commented Mar 6, 2025 • edited Loading

tswast commented Mar 6, 2025

sycai commented Mar 6, 2025 • edited Loading

tswast left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sycai commented Mar 13, 2025

tswast commented Mar 19, 2025

feat: Support dry_run in `to_pandas()` #1436

feat: Support dry_run in `to_pandas()` #1436

tswast commented Mar 6, 2025 •

edited

Loading

sycai commented Mar 6, 2025 •

edited

Loading