Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: refactor read_csv and tests based on bigquery vs. pandas behavior comparison #1595

Merged
merged 2 commits into from
Apr 7, 2025

Conversation

chelsea-lin
Copy link
Contributor

No description provided.

@chelsea-lin chelsea-lin requested a review from sycai April 4, 2025 22:07
@chelsea-lin chelsea-lin requested review from a team as code owners April 4, 2025 22:08
@product-auto-label product-auto-label bot added size: l Pull request size is large. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Apr 4, 2025
@chelsea-lin chelsea-lin force-pushed the main_chelsealin_refactorcsv branch from 623b794 to e3201f8 Compare April 4, 2025 22:09
Copy link
Contributor

@sycai sycai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clean up!

)
return self._read_pandas(pandas_df, api_name="read_csv", write_engine=write_engine) # type: ignore

def _read_csv_w_pandas_engines(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming nit: "_read_csv_w_pandas_engine" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas.read_csv support multiple engines, such as "c", "python", "pyarrow", "python-fwf". Added docstring in the method for easier read.

if header is None:
job_config.skip_leading_rows = 0
elif header > 0:
job_config.skip_leading_rows = header + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-rhetorical question: why "header + 1" here when the original version is just "header"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New refactoring tests compare BigQuery and Pandas behavior and catch the bug mentioned here. However, despite the fix, column naming mismatches still exist, as reported in internal issue 409070192.

@@ -39,6 +39,33 @@
from tests.system import utils


@pytest.fixture(scope="module")
def write_df_to_local_csv_file(scalars_df_index):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming nit: maybe "df_and_local_csv" is better? We are using a lot of noun phrases for fixtures after all

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.



@pytest.fixture(scope="module")
def write_df_to_gcs_csv_file(scalars_df_index, gcs_folder):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming nit "df_and_gcs_csv"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

index_col=False,
)
assert df.shape[0] == scalars_df_index.shape[0]
if index_col is False:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should separate this test method into two. "if-else" branches based on test input are usually a code smell.

go/unit-testing-practices?polyglot=python#logic:

"Tests written without operators or control structures are clearer since the reader doesn't have to do any mental computations to understand them, and are more likely to be correct since it's harder to have bugs in code without these constructs."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points. Done.

@chelsea-lin chelsea-lin force-pushed the main_chelsealin_refactorcsv branch from e3201f8 to b222b22 Compare April 7, 2025 18:35
@chelsea-lin chelsea-lin requested a review from sycai April 7, 2025 21:35
@chelsea-lin chelsea-lin merged commit 11c0e33 into main Apr 7, 2025
24 checks passed
@chelsea-lin chelsea-lin deleted the main_chelsealin_refactorcsv branch April 7, 2025 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants