refactor: refactor read_csv and tests based on bigquery vs. pandas behavior comparison #1595

chelsea-lin · 2025-04-04T22:07:59Z

No description provided.

sycai

Thanks for the clean up!

sycai · 2025-04-04T23:14:40Z

bigframes/session/__init__.py

            )
-            return self._read_pandas(pandas_df, api_name="read_csv", write_engine=write_engine)  # type: ignore
+
+    def _read_csv_w_pandas_engines(


naming nit: "_read_csv_w_pandas_engine" ?

Pandas.read_csv support multiple engines, such as "c", "python", "pyarrow", "python-fwf". Added docstring in the method for easier read.

sycai · 2025-04-04T23:19:12Z

bigframes/session/__init__.py

+        if header is None:
+            job_config.skip_leading_rows = 0
+        elif header > 0:
+            job_config.skip_leading_rows = header + 1


Non-rhetorical question: why "header + 1" here when the original version is just "header"?

New refactoring tests compare BigQuery and Pandas behavior and catch the bug mentioned here. However, despite the fix, column naming mismatches still exist, as reported in internal issue 409070192.

sycai · 2025-04-04T23:22:37Z

tests/system/small/test_session.py

@@ -39,6 +39,33 @@
 from tests.system import utils


+@pytest.fixture(scope="module")
+def write_df_to_local_csv_file(scalars_df_index):


naming nit: maybe "df_and_local_csv" is better? We are using a lot of noun phrases for fixtures after all

sycai · 2025-04-04T23:23:20Z

tests/system/small/test_session.py

+
+
+@pytest.fixture(scope="module")
+def write_df_to_gcs_csv_file(scalars_df_index, gcs_folder):


naming nit "df_and_gcs_csv"?

sycai · 2025-04-04T23:28:03Z

tests/system/small/test_session.py

-        index_col=False,
-    )
-    assert df.shape[0] == scalars_df_index.shape[0]
+    if index_col is False:


Perhaps we should separate this test method into two. "if-else" branches based on test input are usually a code smell.

go/unit-testing-practices?polyglot=python#logic:

"Tests written without operators or control structures are clearer since the reader doesn't have to do any mental computations to understand them, and are more likely to be correct since it's harder to have bugs in code without these constructs."

Good points. Done.

…havior comparison

chelsea-lin requested a review from sycai April 4, 2025 22:07

chelsea-lin requested review from a team as code owners April 4, 2025 22:08

blunderbuss-gcf bot assigned jialuoo Apr 4, 2025

product-auto-label bot added size: l Pull request size is large. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Apr 4, 2025

chelsea-lin force-pushed the main_chelsealin_refactorcsv branch from 623b794 to e3201f8 Compare April 4, 2025 22:09

sycai reviewed Apr 4, 2025

View reviewed changes

chelsea-lin added 2 commits April 7, 2025 18:35

refactor: refactor read_csv and tests based on bigquery vs. pandas be…

4f9b007

…havior comparison

address comments

b222b22

chelsea-lin force-pushed the main_chelsealin_refactorcsv branch from e3201f8 to b222b22 Compare April 7, 2025 18:35

chelsea-lin requested a review from sycai April 7, 2025 21:35

sycai approved these changes Apr 7, 2025

View reviewed changes

chelsea-lin merged commit 11c0e33 into main Apr 7, 2025
24 checks passed

chelsea-lin deleted the main_chelsealin_refactorcsv branch April 7, 2025 21:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: refactor read_csv and tests based on bigquery vs. pandas behavior comparison #1595

refactor: refactor read_csv and tests based on bigquery vs. pandas behavior comparison #1595

chelsea-lin commented Apr 4, 2025

sycai left a comment

sycai Apr 4, 2025

chelsea-lin Apr 7, 2025

sycai Apr 4, 2025

chelsea-lin Apr 7, 2025

sycai Apr 4, 2025

chelsea-lin Apr 7, 2025

sycai Apr 4, 2025

chelsea-lin Apr 7, 2025

sycai Apr 4, 2025

chelsea-lin Apr 7, 2025



		@pytest.fixture(scope="module")
		def write_df_to_gcs_csv_file(scalars_df_index, gcs_folder):

refactor: refactor read_csv and tests based on bigquery vs. pandas behavior comparison #1595

refactor: refactor read_csv and tests based on bigquery vs. pandas behavior comparison #1595

Conversation

chelsea-lin commented Apr 4, 2025

sycai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment