-
Notifications
You must be signed in to change notification settings - Fork 941
Refactor polars configuration #18516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-25.06
Are you sure you want to change the base?
Refactor polars configuration #18516
Conversation
This updates our internal handling of the user-provided configuration for our polars GPUEngine. We use a set of dataclasses to manage the configuration options.
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
We need to figure out exactly how much validation to do, and which errors we raise when we encounter an issue. |
Looking into the CI failures
I wasn't able to immediately reproduce the failure. Edit: I wasn't passing the |
One other thing to figure out: we have a couple other configuration options floating around as environment variables:
Perhaps not But that can probably wait for a separate PR |
ConfigOptions are serialized when using the distributed scheduler. It's not immediatly clear what serialization / deserialization a concrete MemoryResource object means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @TomAugspurger, this is a nice change to make! I left some suggestions
less than or equal to 1. Each factor estimates the fractional number of | ||
unique values in the column. By default, a value of ``1.0`` is used | ||
for any column not found in ``cardinality_factor``. | ||
parquet_blocksize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet_blocksize
is related to partitioning, so it makes sense that it is in the StreamingExecutor
. Can we name it something like partition_blocksize
to avoid confusion since there's a ParquetOptions
dataclass too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. It's a bit verbose, but I think parquet_partition_blocksize
might be best since this is specific to parquet. @rjzamora how does that sound? I can put in a bit of compat code to accept either the new or old name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a first pass at this. I really like the overall design!
I pointed out a few details that need to be adjusted/resolved somehow.
assert ir.config_options.executor.name == "streaming", ( | ||
"'in-memory' executor not supported in 'generate_ir_tasks'" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these kinds of assertions are necessary. Is there a specific reason you think the "in-memory" executor might end up here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type signature of ir.config_options.executor
is StreamingExecutor | InMemoryExecutor
, so without this we get:
$ mypy python/cudf_polars/cudf_polars
python/cudf_polars/cudf_polars/experimental/groupby.py:222: error: Item "InMemoryExecutor" of "StreamingExecutor | InMemoryExecutor" has no attribute "cardinality_factor" [union-attr]
python/cudf_polars/cudf_polars/experimental/groupby.py:286: error: Item "InMemoryExecutor" of "StreamingExecutor | InMemoryExecutor" has no attribute "groupby_n_ary" [union-attr]
Found 2 errors in 1 file (checked 50 source files)
So as far as mypy knows, it is possible to get here with an InMemoryExecutor
.
I think that if we wanted to avoid the assert
s, we would need to make IR
generic over config_options
and config_options
generic over executor
. I can attempt that (though IR is subclassing the generic Node
, so it might get a little tricky. We'll see).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay - No need to avid the assertions if it's a pain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someone more familiar with mypy and generics could probably make this work, but I got stuck on the first part: making ConfigOptions
generic over the executor. Here's my test program that failed:
from __future__ import annotations
import dataclasses
import typing
@dataclasses.dataclass
class A:
name: typing.Literal["a"] = dataclasses.field(default="a", init=False)
@dataclasses.dataclass
class B:
name: typing.Literal["b"] = dataclasses.field(default="b", init=False)
E = typing.TypeVar("E", A, B)
@dataclasses.dataclass
class C(typing.Generic[E]):
value: E = dataclasses.field(default_factory=A)
@classmethod
def from_value(cls, value: str) -> C[E]:
match value:
case "a":
return cls(value=A())
case "b":
return cls(value=B())
case _:
raise ValueError(f"Invalid value: {value}")
reveal_type(C.from_value("a"))
reveal_type(C.from_value("b"))
which gives
$ mypy t.py
t.py:22: error: Incompatible types in assignment (expression has type "A", variable has type "E") [assignment]
t.py:28: error: Argument "value" to "C" has incompatible type "A"; expected "B" [arg-type]
t.py:30: error: Argument "value" to "C" has incompatible type "B"; expected "A" [arg-type]
t.py:35: note: Revealed type is "t.C[t.A]"
t.py:36: note: Revealed type is "t.C[t.A]"
Found 3 errors in 1 file (checked 1 source file)
|
||
>>> chunked = config_options.get("parquet_options.chunked") | ||
TASKS = "tasks" | ||
RAPIDSMFP = "rapidsmpf" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need an "auto" option to emulate the None
default we had previously (i.e. we should try using "rapidsmp" when we have a distributed schedule and rapidsmpf is available, otherwise use "tasks"
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll document this, but on ConfigOptions
this is shuffle_method: ShuffleMethod | None
and the default is to try rapidsmpf and fall back.
We could add "auto"
as an allowed value here, but that would maybe require importing rapidsmp when we initialize the config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll document this, but on ConfigOptions this is shuffle_method: ShuffleMethod | None and the default is to try rapidsmpf and fall back.
Ah, okay.
We could add "auto" as an allowed value here, but that would maybe require importing rapidsmp when we initialize the config.
I was imagining the case that "auto"
is the shuffle method as far as the ShuffleMethod
is concerned. We wouldn't check if rapidsmpf was available until we needed to build the graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, that's what we should have now, it's just via executor.shuffle_method is None
.
Is there an easy way to see which lines aren't covered in the CI failure here https://github.com/rapidsai/cudf/actions/runs/14580875852/job/40898672037?pr=18516#step:10:703? I have 100% coverage locally when I run Edit: nevermind, I can reproduce it now. At least locally, it's just from the Edit again: I think my failing one was getting overwritten by a subsquent run in that script. Anyway, the actual failure is
That line isn't changed, but clearly I've changed the semantics of some configuration value. I'll investigate. |
This is a convenience class to help manage the nested | ||
dictionary of user-accessible `GPUEngine` options. | ||
Upon encountering an unsupported operation, the streaming executor will fall | ||
back to using a single-partition, which might use a large amount of memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
back to using a single-partition, which might use a large amount of memory. | |
back to using a single partition, which might use a large amount of memory. |
Nit.
* ``FallbackMode.WARN`` : Emit a warning and fall back to the CPU engine. | ||
* ``FallbackMode.SILENT``: Silently fall back to the CPU engine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* ``FallbackMode.WARN`` : Emit a warning and fall back to the CPU engine. | |
* ``FallbackMode.SILENT``: Silently fall back to the CPU engine. | |
* ``FallbackMode.WARN`` : Emit a warning and fall back to a single partition. | |
* ``FallbackMode.SILENT``: Silently fall back to a single partition. |
raise TypeError("broadcast_join_limit must be an int") | ||
|
||
def __hash__(self) -> int: | ||
# cardinatlity factory, a dict, isn't natively hashable. We'll dump it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# cardinatlity factory, a dict, isn't natively hashable. We'll dump it | |
# cardinality factory, a dict, isn't natively hashable. We'll dump it |
Description
This updates our internal handling of the user-provided configuration for our polars GPUEngine. We use a set of dataclasses to manage the configuration options.
There aren't any user-facing changes.
Checklist