Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability for Vectorized Scanner in write_pandas #2164

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

culpgrant
Copy link

Please answer these questions before submitting your pull requests. Thanks!

  1. What GitHub issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-1903333: Add ability for USE_VECTORIZED_SCANNER in write_pandas #2157

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am modifying authorization mechanisms
    • I am adding new credentials
    • I am modifying OCSP code
    • I am adding a new dependency
  3. Please describe how your code solves the related issue.

  • Give the user to specify the USE_VECTORIZED_SCANNER parameter in the function write_pandas when running the SQL command COPY INTO
  1. (Optional) PR for stored-proc connector:

Copy link

github-actions bot commented Feb 3, 2025

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@culpgrant
Copy link
Author

I have read the CLA Document and I hereby sign the CLA

@culpgrant culpgrant force-pushed the feature/pandas_tools_vectorized_scanner branch from dfaada9 to 24ff852 Compare February 3, 2025 04:12
@sfc-gh-dszmolka sfc-gh-dszmolka requested review from a team February 3, 2025 09:54
@culpgrant
Copy link
Author

Hey @sfc-gh-dszmolka I was wondering how long does an initial review typically take?

@sfc-gh-dszmolka
Copy link
Contributor

i really cannot comment on it, as i do not own the resources who are responsible for reviewing the PRs. I'm very sorry to hear it's not fast enough, but also don't have any advice at this point besides hoping the team eventually gets there.

Copy link
Collaborator

@sfc-gh-mkeller sfc-gh-mkeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code/feature looks good to me, but I no longer own the Python connector. I'll add some reviewers to move the review forward though

Copy link
Collaborator

@sfc-gh-mmishchenko sfc-gh-mmishchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the added test. It's on one hand an integration test using a real database connection, and on the other hand it mocks all its subsequent queries. Maybe there's a chance it can be converted into a pure unit test?

(False, "FILE_FORMAT=(TYPE=PARQUET COMPRESSION=auto USE_VECTORIZED_SCANNER=FALSE)"),
],
)
def test_write_pandas_use_vectorized_scanner(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it that way that this test makes some assumptions about the internals of write_pandas implementation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I have updated the test to a pure unit test for this vectorized scanner functionality.

Comment on lines 507 to 509
cur = SnowflakeCursor(cnx)
cur._result = iter([])
return cur
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How sure we are that write_pandas will always tolerate this result of execute and some future unrelated changes won't break this test as a side effect?

Comment on lines 505 to 506
if len(args) >= 1 and args[0].startswith("COPY INTO"):
assert expected_file_format in args[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will write_pandas always make just one COPY INTO query?

@culpgrant culpgrant force-pushed the feature/pandas_tools_vectorized_scanner branch from 469aa95 to 0d2df84 Compare February 20, 2025 17:27
@culpgrant
Copy link
Author

Not sure about the added test. It's on one hand an integration test using a real database connection, and on the other hand it mocks all its subsequent queries. Maybe there's a chance it can be converted into a pure unit test?

Yeah I was mostly just copying the existing way of the integration tests, I was using test_table_location_building as a guideline. I am about to push a new change for a fully mocked pure unit test because that is probably the better approach for this.

@culpgrant culpgrant force-pushed the feature/pandas_tools_vectorized_scanner branch from e17c05e to ccaf75f Compare February 25, 2025 04:05
@culpgrant
Copy link
Author

@sfc-gh-mmishchenko Would you be able to take a look? I updated to a pure unit test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SNOW-1903333: Add ability for USE_VECTORIZED_SCANNER in write_pandas
4 participants