feat(python, rust): Add GPU support to sink_* APIs #20940

mroeschke · 2025-01-28T02:17:15Z

xref #20259

Makes the IR::Sink node serializable to Python so the sink options can be exposed in cuDF

codecov · 2025-02-05T22:31:07Z

Codecov Report

Attention: Patch coverage is 0% with 9 lines in your changes missing coverage. Please review.

Project coverage is 80.78%. Comparing base (4f75730) to head (fdcb58d).
Report is 51 commits behind head on main.

Files with missing lines	Patch %	Lines
...rates/polars-python/src/lazyframe/visitor/nodes.rs	0.00%	9 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #20940      +/-   ##
==========================================
- Coverage   80.79%   80.78%   -0.02%     
==========================================
  Files        1639     1639              
  Lines      235551   235557       +6     
  Branches     2714     2714              
==========================================
- Hits       190315   190283      -32     
- Misses      44595    44633      +38     
  Partials      641      641

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mroeschke · 2025-02-07T01:39:01Z

@wence- any suggestions on how we should expose the FuctionIR::Pipeline node to Python? It appears polars.LazyFrame.sink_csv uses this node. I suppose we can't accept all types of pipeline nodes (which we would raise in cudf as well).

polars/crates/polars-python/src/lazyframe/visitor/nodes.rs

Lines 562 to 565 in 15fa50a

    
           FunctionIR::Pipeline { 
        
               function: _, 
        
               schema: _, 
        
               original: _,

wence- · 2025-02-11T09:22:06Z

@wence- any suggestions on how we should expose the FuctionIR::Pipeline node to Python? It appears polars.LazyFrame.sink_csv uses this node. I suppose we can't accept all types of pipeline nodes (which we would raise in cudf as well).

polars/crates/polars-python/src/lazyframe/visitor/nodes.rs

Lines 562 to 565 in 15fa50a

FunctionIR::Pipeline {

function: _,

schema: _,

original: _,

I think I need some more context here to understand what is going on. Where does that pipeline come from when you do sink_csv, I don't see it?

mroeschke · 2025-02-11T21:07:56Z

So with this PR on 6679b4b and rapidsai/cudf#17938 on 73bf00f, this is the error I'm getting so far

In [1]: import polars as pl

In [2]: pl.LazyFrame({"1": 2}).sink_csv("foo.csv", engine="gpu")
<ipython-input-2-efb5e662c021>:1: DeprecationWarning: The old streaming engine is being deprecated and will soon be replaced by the new streaming engine. Starting Polars version 1.23.0 and until the new streaming engine is released, the old streaming engine may become less usable. For people who rely on the old streaming engine, it is suggested to pin your version to before 1.23.0.

More information on the new streaming engine: https://github.com/pola-rs/polars/issues/20947
  pl.LazyFrame({"1": 2}).sink_csv("foo.csv", engine="gpu")
/polars/py-polars/polars/lazyframe/frame.py:2806: PerformanceWarning: Query execution with GPU not possible: unsupported operations.
The errors were:
- NotImplementedError: pipeline mapfunction
  return lf.sink_csv(
run UdfExec
RUN STREAMING PIPELINE
[df -> ordered_sink]

Where the NotImplementedError is coming from

polars/crates/polars-python/src/lazyframe/visitor/nodes.rs

Line 566 in 15fa50a

} => return Err(PyNotImplementedError::new_err("pipeline mapfunction")),

So I'm assuming we'll need to translate the FuctionIR::Pipeline to Python for any of the polars sink_* methods

wence- · 2025-02-12T13:17:44Z

Ah OK, I see. If you let ldf = ldf.clone().with_streaming(false) then you won't see the pipeline node. However, you won't see a Sink node either (because the wrapping of logical plan in a Sink happens inside ldf.sink_csv(...))

mroeschke · 2025-02-12T23:54:30Z

crates/polars-python/src/lazyframe/general.rs

+                        // Unpack the arena's.
+                        // At this point the `nt` is useless.
+
+                        std::mem::swap(lp_arena, &mut *arenas.0.lock().unwrap());


Thanks for the tips on @wence- at #20940 (comment).

Now I'm able to get cudf to write the csv successfully, but the polars call raises this error. I think the failure might be somewhere here?

In [1]: import polars as pl In [2]: pl.LazyFrame({"1": 2}).sink_csv("foo.csv", engine="gpu") --------------------------------------------------------------------------- ComputeError Traceback (most recent call last) Cell In[2], line 1 ----> 1 pl.LazyFrame({"1": 2}).sink_csv("foo.csv", engine="gpu") File ~/polars/py-polars/polars/_utils/unstable.py:58, in unstable.<locals>.decorate.<locals>.wrapper(*args, **kwargs) 55 @wraps(function) 56 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T: 57 issue_unstable_warning(f"`{function.__name__}` is considered unstable.") ---> 58 return function(*args, **kwargs) File ~/polars/py-polars/polars/lazyframe/frame.py:2806, in LazyFrame.sink_csv(self, path, include_bom, include_header, separator, line_terminator, quote_char, batch_size, datetime_format, date_format, time_format, float_scientific, float_precision, null_value, quote_style, maintain_order, type_coercion, _type_check, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, collapse_joins, no_optimization, storage_options, credential_provider, retries, engine) 2803 else: 2804 callback = None -> 2806 return lf.sink_csv( 2807 path=normalize_filepath(path), 2808 include_bom=include_bom, 2809 include_header=include_header, 2810 separator=ord(separator), 2811 line_terminator=line_terminator, 2812 quote_char=ord(quote_char), 2813 batch_size=batch_size, 2814 datetime_format=datetime_format, 2815 date_format=date_format, 2816 time_format=time_format, 2817 float_scientific=float_scientific, 2818 float_precision=float_precision, 2819 null_value=null_value, 2820 quote_style=quote_style, 2821 maintain_order=maintain_order, 2822 cloud_options=storage_options, 2823 credential_provider=credential_provider, 2824 retries=retries, 2825 lambda_post_opt=callback, 2826 ) ComputeError: expected tuple got None In [3]: cat foo.csv 0 2

The problem is that the callback inserts a PythonScan node into the IR. The execution of this node expects the callback to return either a DataFrame or a non-empty iterable of dataframe chunks.

Of course, because we're sinking to a file, we can return no such thing.

I suspect will want an equivalent to PythonScan callback which might be a PythonSink. Any thoughts @ritchie46?

This needs some architectual changes. The Sink architecture is unstable at the moment and we will connect the new-streaming engine in a few weeks. Once we have that, I think we can prepare for a "special" kind of sink that offloads to the GPU engine and doesn't expect a DataFrame to be returned. Currently this is an assumption.

ritchie46

I think this is blocked until we have some architectual changes on the sink side. I expect to get to this in a few weeks.

ritchie46 · 2025-02-26T08:30:14Z

py-polars/polars/lazyframe/frame.py

@@ -2615,6 +2615,7 @@ def sink_csv(
        | Literal["auto"]
        | None = "auto",
        retries: int = 2,
+        engine: EngineType = "cpu",


Suggested change

engine: EngineType = "cpu",

engine: EngineType = "streaming",

ritchie46 · 2025-02-26T08:31:02Z

py-polars/polars/lazyframe/frame.py

@@ -2721,6 +2722,26 @@ def sink_csv(
                at any point without it being considered a breaking change.
        retries
            Number of retries if accessing a cloud instance fails.
+        engine
+            Select the engine used to write the query result, optional.
+            If set to `"cpu"` (default), the query result is written using the


Our "cpu" engine typically means our in-memory engine, this doesn't support sinking without collecting into memory first. I think this should only have "streaming" and "gpu" options.

ritchie46 · 2025-02-26T08:34:46Z

crates/polars-python/src/lazyframe/general.rs

+                        // Unpack the arena's.
+                        // At this point the `nt` is useless.
+
+                        std::mem::swap(lp_arena, &mut *arenas.0.lock().unwrap());


This needs some architectual changes. The Sink architecture is unstable at the moment and we will connect the new-streaming engine in a few weeks. Once we have that, I think we can prepare for a "special" kind of sink that offloads to the GPU engine and doesn't expect a DataFrame to be returned. Currently this is an assumption.

wence- · 2025-03-26T10:42:26Z

crates/polars-pipe/src/pipeline/convert.rs

@@ -181,7 +181,7 @@ where
                    Box::new(OrderedSink::new(input_schema.into_owned())) as Box<dyn SinkTrait>
                },
                #[allow(unused_variables)]
-                SinkTypeIR::File(FileSinkType {
+                SinkTypeIRf::File(FileSinkType {


Suggested change

SinkTypeIRf::File(FileSinkType {

SinkTypeIR::File(FileSinkType {

wence- · 2025-03-26T10:47:49Z

crates/polars-python/src/conversion/mod.rs

+impl<'py> IntoPyObject<'py> for Wrap<CsvWriterOptions> {
+    type Target = PyDict;
+    type Output = Bound<'py, Self::Target>;
+    type Error = PyErr;
+
+    fn into_pyobject(self, py: Python<'py>) -> Result<Self::Output, Self::Error> {
+        let dict = PyDict::new(py);
+        let _ = dict.set_item("include_bom", self.0.include_bom);
+        let _ = dict.set_item("include_header", self.0.include_header);
+        let _ = dict.set_item("batch_size", self.0.batch_size);
+        let _ = dict.set_item("serialize_options", Wrap(self.0.serialize_options));
+        Ok(dict)
+    }
+}
+
+impl<'py> IntoPyObject<'py> for Wrap<SerializeOptions> {
+    type Target = PyDict;
+    type Output = Bound<'py, Self::Target>;
+    type Error = PyErr;
+
+    fn into_pyobject(self, py: Python<'py>) -> Result<Self::Output, Self::Error> {
+        let dict = PyDict::new(py);
+        let _ = dict.set_item("date_format", self.0.date_format);
+        let _ = dict.set_item("time_format", self.0.time_format);
+        let _ = dict.set_item("datetime_format", self.0.datetime_format);
+        let _ = dict.set_item("float_scientific", self.0.float_scientific);
+        let _ = dict.set_item("float_precision", self.0.float_precision);
+        let _ = dict.set_item("separator", self.0.separator);
+        let _ = dict.set_item("quote_char", self.0.quote_char);
+        let _ = dict.set_item("null", self.0.null);
+        let _ = dict.set_item("line_terminator", self.0.line_terminator);
+        let _ = dict.set_item("quote_style", Wrap(self.0.quote_style));
+        Ok(dict)
+    }
+}
+
+impl<'py> IntoPyObject<'py> for Wrap<QuoteStyle> {
+    type Target = PyString;
+    type Output = Bound<'py, Self::Target>;
+    type Error = Infallible;
+


suggestion: Since these are "small" dicts, but to avoid getting out of date, I've been using the automatic serde-based serialisation for these options, which also isolates things just to the translation layer.

wence- · 2025-03-26T11:13:48Z

crates/polars-python/src/lazyframe/visitor/nodes.rs

+                        FileType::Parquet(options) => Box::new(ParquetSink::new(
+                            path,
+                            *options,
+                            input_schema.as_ref(),
+                            cloud_options.as_ref(),
+                        )?)
+                            as Box<dyn SinkTrait>,
+                        #[cfg(feature = "ipc")]
+                        FileType::Ipc(options) => Box::new(IpcSink::new(
+                            path,
+                            *options,
+                            input_schema.as_ref(),
+                            cloud_options.as_ref(),
+                        )?) as Box<dyn SinkTrait>,
+                        #[cfg(feature = "csv")]
+                        FileType::Csv(options) => Box::new(CsvSink::new(
+                            path,
+                            options.clone(),
+                            input_schema.as_ref(),
+                            cloud_options.as_ref(),
+                        )?) as Box<dyn SinkTrait>,
+                        #[cfg(feature = "json")]
+                        FileType::Json(options) => Box::new(JsonSink::new(
+                            path,
+                            *options,
+                            input_schema.as_ref(),
+                            cloud_options.as_ref(),
+                        )?)
+                            as Box<dyn SinkTrait>,
+                        #[allow(unreachable_patterns)]
+                        _ => unreachable!(),


Perhaps we can just do the following to get all the relevant info, serialised as json (which means that if bits change we don't have to make too many matching updates here):

IR::Sink { input, payload } => Sink { input: input.0, payload: serde_json::to_string(payload) .map_err(|err| PyValueError::new_err(format!("{err:?}")))?, } .into_py_any(py),

WDYT?

Ah thanks! This works well - I see that is done for the IO readers as well.

mroeschke · 2025-04-21T16:29:42Z

@ritchie46 could you re-review this PR? (it's gotten a lot simpler with the dedicate Sink IR node)

ritchie46

Yes, a lot simpler. :)

Depends on pola-rs/polars#20940 closes #16738 Authors: - Matthew Roeschke (https://github.com/mroeschke) - Matthew Murray (https://github.com/Matt711) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Vukasin Milovanovic (https://github.com/vuule) - Lawrence Mitchell (https://github.com/wence-) URL: #18468

WIP: Add GPU engine to sink_csv

3355198

github-actions bot added the title needs formatting label Jan 28, 2025

mroeschke changed the title ~~WIP: Add GPU engine to sink_csv~~ feat(python): WIP: Add GPU engine to sink_csv Jan 29, 2025

github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars and removed title needs formatting labels Jan 29, 2025

mroeschke added 8 commits January 29, 2025 13:27

Merge remote-tracking branch 'upstream/main' into gpu/sink_csv

4b7d314

Move implementation to lazyframe/general.rs

12c7e83

Just need to convert options

7bd6b0a

Add lambda_post_opt keyword

a682c82

Start defining conversion traits for options

b73e29c

Merge remote-tracking branch 'upstream/main' into gpu/sink_csv

d8449e4

Finalize into_pyobject traits

670aeb6

pandas -> polars

86567a3

mroeschke marked this pull request as ready for review February 4, 2025 03:43

mroeschke requested review from ritchie46, c-peters, alexander-beedie, MarcoGorelli, reswqa and orlp as code owners February 4, 2025 03:43

mroeschke added 3 commits February 5, 2025 14:13

fix formatting, and unit test

dcd2020

Ruff formatting python test

985c0c6

another rust fmt

54731fc

mroeschke changed the title ~~feat(python): WIP: Add GPU engine to sink_csv~~ feat(python): Add GPU engine to sink_csv Feb 6, 2025

mroeschke changed the title ~~feat(python): Add GPU engine to sink_csv~~ feat(python, rust): Add GPU engine to sink_csv Feb 6, 2025

github-actions bot added the title needs formatting label Feb 6, 2025

mroeschke mentioned this pull request Feb 6, 2025

Add DataFrame.sink_csv for cudf_polars rapidsai/cudf#17938

Closed

3 tasks

Pass options as a python dict

276d5ed

Merge remote-tracking branch 'upstream/main' into gpu/sink_csv

ac4f128

mark ldf as not streaming

f11ca5c

mroeschke commented Feb 12, 2025

View reviewed changes

ritchie46 reviewed Feb 26, 2025

View reviewed changes

vyasr added this to cuDF Python Feb 26, 2025

vyasr moved this to Todo in cuDF Python Feb 26, 2025

wence- assigned mroeschke Feb 27, 2025

mroeschke added 4 commits March 19, 2025 16:20

Merge remote-tracking branch 'upstream/main' into gpu/sink_csv

d9fb09f

Undo rust changes

c9324f7

Add back conversion traits for sink_csv options

7f1a7ca

copy over some reference copy logic

4d8842f

wence- reviewed Mar 26, 2025

View reviewed changes

mroeschke added 4 commits April 1, 2025 17:02

Merge remote-tracking branch 'upstream/main' into gpu/sink_csv

309757c

Merge remote-tracking branch 'upstream/main' into gpu/sink_csv

e6342eb

Serialize options as string

a9fa758

Undo unneeded conversion traits

2b2184a

mroeschke mentioned this pull request Apr 9, 2025

Add sink support in cudf_polars rapidsai/cudf#18468

Merged

3 tasks

mroeschke changed the title ~~feat(python, rust): Add GPU engine to sink_csv~~ feat(python, rust): Add GPU support to sink_* APIs Apr 9, 2025

mroeschke added 2 commits April 11, 2025 16:24

Merge remote-tracking branch 'upstream/main' into gpu/sink_csv

eab2d1b

Remove note for lack of gpu support

fdcb58d

ritchie46 approved these changes Apr 21, 2025

View reviewed changes

ritchie46 merged commit e66d3aa into pola-rs:main Apr 21, 2025
27 checks passed

github-project-automation bot moved this from Todo to Done in cuDF Python Apr 21, 2025

mroeschke deleted the gpu/sink_csv branch April 21, 2025 17:50

	engine: EngineType = "cpu",
	engine: EngineType = "streaming",

	SinkTypeIRf::File(FileSinkType {
	SinkTypeIR::File(FileSinkType {

Uh oh!

feat(python, rust): Add GPU support to sink_* APIs #20940

feat(python, rust): Add GPU support to sink_* APIs #20940

Uh oh!

Conversation

mroeschke commented Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mroeschke commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wence- commented Feb 11, 2025

Uh oh!

mroeschke commented Feb 11, 2025

Uh oh!

wence- commented Feb 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ritchie46 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mroeschke commented Apr 21, 2025

Uh oh!

ritchie46 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mroeschke commented Jan 28, 2025 •

edited

Loading

codecov bot commented Feb 5, 2025 •

edited

Loading

mroeschke commented Feb 7, 2025 •

edited

Loading