feat(pyspark): support `partition_by` key for PySpark file writes #8900

deepyaman · 2024-04-05T17:14:26Z

Is your feature request related to a problem?

I can't specify partition key.

Describe the solution you'd like

I'd like to be able to specify partition key. Basically, PySpark takes partitionBy key. Need to figure out if just want to alias partition_by passed to the write method as partitionBy.

Need to also verify read path works; current test works using / * / * wildcard pattern; PySpark may require resolving that to a set of paths to read.

What version of ibis are you running?

8

What backend(s) are you using, if any?

PySpark

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

mark-druffel · 2025-02-14T19:14:16Z

My team came across this and need to use partition_by. Is this already being worked on, if not we'd love to open a PR and give it a shot

@jakepenzak

cpcloud · 2025-02-14T19:37:42Z

It is not being worked on, please give it a go!

…artitionBy argument to create_table method in pyspark backend to enable partitioned table creation\n\nfixes ibis-project#8900

Adds the partitionBy argument to create_table method in pyspark backend to enable partitioned table creation fixes ibis-project#8900

jakepenzak · 2025-02-15T15:21:51Z

Threw up this PR (#10850) that adds partition_by argument for create_table method to enable partitioning in writes to hive database, but missed to_parquet method not working for partitioning.

Looks like for PySpark backend, partitioning is available for to_delta natively via pyspark.sql.DataFrameWriter kwargs. However, it isn't supported for to_parquet due to, what I believe to be, no native support for partitioning in pyarrow.parquet.ParquetWriter? We can potentially override to_parquet method in PySpark backend to leverage pyspark.sql.DataFrameWriter following same pattern as overridden to_delta method. I'll go ahead and follow this pattern, but open to any other thoughts or recommendations?

Override to_parquet method in pyspark backend to leverage pyspark.sql.DataFrameWriter to enable partitioning and other kwargs fixes ibis-project#8900

Adds the partitionBy argument to create_table method in pyspark backend to enable partitioned table creation fixes ibis-project#8900

Override to_parquet method in pyspark backend to leverage pyspark.sql.DataFrameWriter to enable partitioning and other kwargs fixes ibis-project#8900

Adds the partitionBy argument to create_table method in pyspark backend to enable partitioned table creation fixes ibis-project#8900

Adds the partitionBy argument to create_table method in pyspark backend to enable partitioned table creation fixes #8900

deepyaman added the feature Features or general enhancements label Apr 5, 2024

github-project-automation bot added this to Ibis planning and roadmap Apr 5, 2024

github-project-automation bot moved this to backlog in Ibis planning and roadmap Apr 5, 2024

gforsyth added pyspark The Apache PySpark backend io Issues related to input and/or output labels Apr 5, 2024

gforsyth changed the title ~~feat(pyspark): support partition_by key for PySpark~~ feat(pyspark): support partition_by key for PySpark file writes Apr 5, 2024

jakepenzak added a commit to jakepenzak/ibis that referenced this issue Feb 15, 2025

feat(pyspark): add partitionBy argument to create_table

94c5797

Adds the partitionBy argument to create_table method in pyspark backend to enable partitioned table creation fixes ibis-project#8900

jakepenzak mentioned this issue Feb 15, 2025

feat(pyspark): support partitioning in PySpark backend file writes #10850

Merged

cpcloud pushed a commit to jakepenzak/ibis that referenced this issue Feb 19, 2025

feat(pyspark): add partitionBy argument to create_table

c68b6ef

Adds the partitionBy argument to create_table method in pyspark backend to enable partitioned table creation fixes ibis-project#8900

cpcloud pushed a commit to jakepenzak/ibis that referenced this issue Feb 19, 2025

feat(pyspark): add partitionBy argument to create_table

833d895

Adds the partitionBy argument to create_table method in pyspark backend to enable partitioned table creation fixes ibis-project#8900

cpcloud pushed a commit that referenced this issue Feb 19, 2025

feat(pyspark): add partitionBy argument to create_table

c99cc23

Adds the partitionBy argument to create_table method in pyspark backend to enable partitioned table creation fixes #8900

cpcloud closed this as completed in #10850 Feb 19, 2025

github-project-automation bot moved this from backlog to done in Ibis planning and roadmap Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pyspark): support `partition_by` key for PySpark file writes #8900

feat(pyspark): support `partition_by` key for PySpark file writes #8900

deepyaman commented Apr 5, 2024

mark-druffel commented Feb 14, 2025 •

edited

Loading

cpcloud commented Feb 14, 2025

jakepenzak commented Feb 15, 2025 •

edited

Loading

feat(pyspark): support partition_by key for PySpark file writes #8900

feat(pyspark): support partition_by key for PySpark file writes #8900

Comments

deepyaman commented Apr 5, 2024

Is your feature request related to a problem?

Describe the solution you'd like

What version of ibis are you running?

What backend(s) are you using, if any?

Code of Conduct

mark-druffel commented Feb 14, 2025 • edited Loading

cpcloud commented Feb 14, 2025

jakepenzak commented Feb 15, 2025 • edited Loading

feat(pyspark): support `partition_by` key for PySpark file writes #8900

feat(pyspark): support `partition_by` key for PySpark file writes #8900

mark-druffel commented Feb 14, 2025 •

edited

Loading

jakepenzak commented Feb 15, 2025 •

edited

Loading