Skip to content

feat(pyspark): support partitioning in PySpark backend file writes #10850

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 19, 2025

Conversation

jakepenzak
Copy link
Contributor

@jakepenzak jakepenzak commented Feb 15, 2025

Description of changes

  • Enabled partitioning in create_table and to_parquet methods for pyspark backend (already existed for to_delta)
    • Added partition_by argument to create_table method for PySpark backend
    • Overrode to_parquet method for PySpark backend to leverage pyspark.sql.DataFrameWriter using a similar pattern as to_delta override, enabling corresponding kwargs for partitioning
    • Added corresponding tests to ensure partitioning behaves as expected

Issues closed

@github-actions github-actions bot added tests Issues or PRs related to tests pyspark The Apache PySpark backend labels Feb 15, 2025
@jakepenzak jakepenzak changed the title feat(pyspark): support partition_by in create_table method feat(pyspark): support partitioning in PySpark backend file writes Feb 15, 2025
Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

🚢 it!

@cpcloud
Copy link
Member

cpcloud commented Feb 19, 2025

I'll fix up the remote test failures, which should probably be skipped given that the writing location for a remote spark instance isn't well-defined (unless it's a bucket, but the tests deal only in local file paths).

Adds the partitionBy argument to create_table method in pyspark backend to enable partitioned table creation

fixes ibis-project#8900
@cpcloud cpcloud enabled auto-merge (rebase) February 19, 2025 11:20
@cpcloud cpcloud merged commit c99cc23 into ibis-project:main Feb 19, 2025
88 of 89 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pyspark The Apache PySpark backend tests Issues or PRs related to tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat(pyspark): support partition_by key for PySpark file writes
2 participants