Skip to content

feat: Add lazy sinks #21733

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 14, 2025
Merged

feat: Add lazy sinks #21733

merged 5 commits into from
Mar 14, 2025

Conversation

coastalwhite
Copy link
Collaborator

@coastalwhite coastalwhite commented Mar 13, 2025

This PR adds a lazy boolean flag to all sinks. If this is set to true, the sink returns a LazyFrame and .collect() needs to be called before it gets executed. The collect returns an empty DataFrame. This also now allows combination of sink_* and collect_all.

Example

import polars as pl
from pathlib import Path

p = Path("./")

lf = pl.LazyFrame({"a": [1, 2, 3]})
lf1 = lf.sink_parquet(p / "a.parquet", lazy=True)
lf2 = lf.sink_csv(p / "a.csv", lazy=True)

assert not Path(p / "a.parquet").exists()
assert not Path(p / "a.csv").exists()

pl.collect_all([lf1, lf2])

Fixes #6506.

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Mar 13, 2025
@coastalwhite coastalwhite added highlight Highlight this PR in the changelog and removed python Related to Python Polars rust Related to Rust Polars enhancement New feature or an improvement of an existing feature labels Mar 13, 2025
Copy link

codecov bot commented Mar 13, 2025

Codecov Report

Attention: Patch coverage is 93.60465% with 11 lines in your changes missing coverage. Please review.

Project coverage is 81.02%. Comparing base (ea82623) to head (555af63).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-lazy/src/frame/mod.rs 92.37% 9 Missing ⚠️
py-polars/polars/functions/lazy.py 66.66% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #21733      +/-   ##
==========================================
- Coverage   81.03%   81.02%   -0.01%     
==========================================
  Files        1610     1610              
  Lines      233031   233003      -28     
  Branches     2685     2689       +4     
==========================================
- Hits       188837   188802      -35     
- Misses      43563    43570       +7     
  Partials      631      631              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Mar 13, 2025
@thomasfrederikhoeck
Copy link
Contributor

thomasfrederikhoeck commented Mar 13, 2025

@coastalwhite why return the emtpy dataframe? Could that lead to misunderstandings with a query like:

pl.scan_csv("somefile").sink_csv("somefile2", lazy=True).collect().with_columns(pl.all() * 2).write_csv("somefile3")

Would it make more sense for .sink_ to not be a LazyFrame but is own type such as Sink which could maybe return the file paths instead (or similar) when called with .collect()?

@coastalwhite
Copy link
Collaborator Author

coastalwhite commented Mar 13, 2025

@coastalwhite why return the emtpy dataframe? Could that lead to misunderstandings with a query like:

pl.scan_csv("somefile").sink_csv("somefile2", lazy=True).collect().with_columns(pl.all() * 2).write_csv("somefile3")

Would it make more sense for .sink_ to not be a LazyFrame but is own type such as Sink which could maybe return the file paths instead (or similar)?

We could maybe do that. I would require some magic on the python side.

@thomasfrederikhoeck
Copy link
Contributor

@coastalwhite the file-path was just a suggestion of something that could be meaningful to return. I think mostly I would be worried about the collect returning the empty DF.

@coastalwhite
Copy link
Collaborator Author

After thinking about it a bit more, for now, I don't think it is a good idea. If you explicitly say lazy, I am assuming you are going to do something with collect_all so it shouldn't matter that much.

@gab23r
Copy link
Contributor

gab23r commented Mar 13, 2025

Fixed #6506

@ritchie46
Copy link
Member

Fixed #6506

Almost the CSE part is incoming.

@coastalwhite coastalwhite merged commit 783a3d7 into pola-rs:main Mar 14, 2025
27 checks passed
@coastalwhite coastalwhite deleted the feat/lazy-sinks branch March 14, 2025 09:12
jsjasonseba pushed a commit to jsjasonseba/polars that referenced this pull request Mar 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature highlight Highlight this PR in the changelog python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Write multiple parquet files in parallel
4 participants