Skip to content

feat(pyspark): support partition_by key for PySpark file writes #8900

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
deepyaman opened this issue Apr 5, 2024 · 3 comments · Fixed by #10850
Closed
1 task done

feat(pyspark): support partition_by key for PySpark file writes #8900

deepyaman opened this issue Apr 5, 2024 · 3 comments · Fixed by #10850
Labels
feature Features or general enhancements io Issues related to input and/or output pyspark The Apache PySpark backend

Comments

@deepyaman
Copy link
Contributor

Is your feature request related to a problem?

I can't specify partition key.

Describe the solution you'd like

I'd like to be able to specify partition key. Basically, PySpark takes partitionBy key. Need to figure out if just want to alias partition_by passed to the write method as partitionBy.

Need to also verify read path works; current test works using / * / * wildcard pattern; PySpark may require resolving that to a set of paths to read.

What version of ibis are you running?

8

What backend(s) are you using, if any?

PySpark

Code of Conduct

  • I agree to follow this project's Code of Conduct
@deepyaman deepyaman added the feature Features or general enhancements label Apr 5, 2024
@gforsyth gforsyth added pyspark The Apache PySpark backend io Issues related to input and/or output labels Apr 5, 2024
@gforsyth gforsyth changed the title feat(pyspark): support partition_by key for PySpark feat(pyspark): support partition_by key for PySpark file writes Apr 5, 2024
@mark-druffel
Copy link

mark-druffel commented Feb 14, 2025

My team came across this and need to use partition_by. Is this already being worked on, if not we'd love to open a PR and give it a shot

@jakepenzak

@cpcloud
Copy link
Member

cpcloud commented Feb 14, 2025

It is not being worked on, please give it a go!

jakepenzak added a commit to jakepenzak/ibis that referenced this issue Feb 15, 2025
…artitionBy argument to create_table method in pyspark backend to enable partitioned table creation\n\nfixes ibis-project#8900
jakepenzak added a commit to jakepenzak/ibis that referenced this issue Feb 15, 2025
Adds the partitionBy argument to create_table method in pyspark backend to enable partitioned table creation

fixes ibis-project#8900
@jakepenzak
Copy link
Contributor

jakepenzak commented Feb 15, 2025

Threw up this PR (#10850) that adds partition_by argument for create_table method to enable partitioning in writes to hive database, but missed to_parquet method not working for partitioning.

Looks like for PySpark backend, partitioning is available for to_delta natively via pyspark.sql.DataFrameWriter kwargs. However, it isn't supported for to_parquet due to, what I believe to be, no native support for partitioning in pyarrow.parquet.ParquetWriter? We can potentially override to_parquet method in PySpark backend to leverage pyspark.sql.DataFrameWriter following same pattern as overridden to_delta method. I'll go ahead and follow this pattern, but open to any other thoughts or recommendations?

jakepenzak added a commit to jakepenzak/ibis that referenced this issue Feb 15, 2025
Override to_parquet method in pyspark backend to leverage
pyspark.sql.DataFrameWriter to enable partitioning and other kwargs

fixes ibis-project#8900
cpcloud pushed a commit to jakepenzak/ibis that referenced this issue Feb 19, 2025
Adds the partitionBy argument to create_table method in pyspark backend to enable partitioned table creation

fixes ibis-project#8900
cpcloud pushed a commit to jakepenzak/ibis that referenced this issue Feb 19, 2025
Override to_parquet method in pyspark backend to leverage
pyspark.sql.DataFrameWriter to enable partitioning and other kwargs

fixes ibis-project#8900
cpcloud pushed a commit to jakepenzak/ibis that referenced this issue Feb 19, 2025
Adds the partitionBy argument to create_table method in pyspark backend to enable partitioned table creation

fixes ibis-project#8900
cpcloud pushed a commit that referenced this issue Feb 19, 2025
Adds the partitionBy argument to create_table method in pyspark backend to enable partitioned table creation

fixes #8900
@github-project-automation github-project-automation bot moved this from backlog to done in Ibis planning and roadmap Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements io Issues related to input and/or output pyspark The Apache PySpark backend
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants