-
-
Notifications
You must be signed in to change notification settings - Fork 336
Add Polars Builtin Check Tests #1518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Polars Builtin Check Tests #1518
Conversation
is_in_min = ( | ||
n_chars.ge(min_value) if min_value is not None else pl.lit(True) | ||
) | ||
is_in_max = ( | ||
n_chars.le(max_value) if max_value is not None else pl.lit(True) | ||
) | ||
|
||
return data.dataframe.select(is_in_min.and_(is_in_max)) | ||
return data.dataframe.select(is_in_min.and_(is_in_max).alias(data.key)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to explicitly alias here. If min_value
is None, then resulting column name will be literal
, which will break Polars lazy computation. Seems to be related to pola-rs/polars#14382
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this make sense to do instead:
if min_value is None and max_value is None:
raise ValueError("Must provide at least on of 'min_value' and 'max_value'")
n_chars = pl.col(data.key).str.n_chars()
if min_value is None:
expr = n_chars.le(max_value)
elif max_value is None:
expr = n_chars.ge(min_value)
else:
expr = n_chars.between(min_value, max_value)
return expr
@register_builtin_check( | ||
error="str_matches('{pattern}')", | ||
) | ||
def str_matches( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to remove the str_matches
and keep only str_contains
check with .str.contains(..., literal=False)
since it works for both strings and regex.
Also I see that the str_matches
check for Pandas searches for matches from the beginning of a string using the Pandas str.match()
method and I don't think there is an equivalent method in Polars. So sticking just with str.contains
seemed appropriate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think where we can, check backends should implement the built-in check functions, even if there isn't a direct method supported in the underlying framework.
So the implementation would be something like
if isinstance(pattern, str):
pattern = f"^{pattern}"
else:
pattern = re.compile(f"^{pattern.pattern}")
return data.dataframe.select(
pl.col(data.key).str.contains(pattern=pattern)
)
@register_builtin_check( | ||
error="str_contains('{pattern}')", | ||
) | ||
def str_contains( | ||
data: PolarsData, | ||
pattern: str, | ||
pattern: re.Pattern, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be consistent with the public Check api: https://github.com/unionai-oss/pandera/blob/main/pandera/api/checks.py#L430
pattern: re.Pattern, | |
pattern: Union[str, re.Pattern], |
@register_builtin_check( | ||
error="str_matches('{pattern}')", | ||
) | ||
def str_matches( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think where we can, check backends should implement the built-in check functions, even if there isn't a direct method supported in the underlying framework.
So the implementation would be something like
if isinstance(pattern, str):
pattern = f"^{pattern}"
else:
pattern = re.compile(f"^{pattern.pattern}")
return data.dataframe.select(
pl.col(data.key).str.contains(pattern=pattern)
)
is_in_min = ( | ||
n_chars.ge(min_value) if min_value is not None else pl.lit(True) | ||
) | ||
is_in_max = ( | ||
n_chars.le(max_value) if max_value is not None else pl.lit(True) | ||
) | ||
|
||
return data.dataframe.select(is_in_min.and_(is_in_max)) | ||
return data.dataframe.select(is_in_min.and_(is_in_max).alias(data.key)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this make sense to do instead:
if min_value is None and max_value is None:
raise ValueError("Must provide at least on of 'min_value' and 'max_value'")
n_chars = pl.col(data.key).str.n_chars()
if min_value is None:
expr = n_chars.le(max_value)
elif max_value is None:
expr = n_chars.ge(min_value)
else:
expr = n_chars.between(min_value, max_value)
return expr
Signed-off-by: Andrii G <[email protected]>
8c817ec
to
dee3ab7
Compare
Signed-off-by: cosmicBboy <[email protected]>
8e9a651
to
6d77c6f
Compare
Signed-off-by: cosmicBboy <[email protected]>
af4e31e
to
90963b5
Compare
Thanks for the contribution @AndriiG13! I updated some of the checks and tests, will merge this into |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## polars-dev #1518 +/- ##
=============================================
Coverage ? 83.06%
=============================================
Files ? 110
Lines ? 8018
Branches ? 0
=============================================
Hits ? 6660
Misses ? 1358
Partials ? 0 ☔ View full report in Codecov by Sentry. |
* add polars builtin check tests Signed-off-by: Andrii G <[email protected]> * update str_length, update tests Signed-off-by: cosmicBboy <[email protected]> * add test, update docstring Signed-off-by: cosmicBboy <[email protected]> --------- Signed-off-by: Andrii G <[email protected]> Signed-off-by: cosmicBboy <[email protected]> Co-authored-by: cosmicBboy <[email protected]>
* add polars builtin check tests Signed-off-by: Andrii G <[email protected]> * update str_length, update tests Signed-off-by: cosmicBboy <[email protected]> * add test, update docstring Signed-off-by: cosmicBboy <[email protected]> --------- Signed-off-by: Andrii G <[email protected]> Signed-off-by: cosmicBboy <[email protected]> Co-authored-by: cosmicBboy <[email protected]>
* add polars builtin check tests Signed-off-by: Andrii G <[email protected]> * update str_length, update tests Signed-off-by: cosmicBboy <[email protected]> * add test, update docstring Signed-off-by: cosmicBboy <[email protected]> --------- Signed-off-by: Andrii G <[email protected]> Signed-off-by: cosmicBboy <[email protected]> Co-authored-by: cosmicBboy <[email protected]>
* add polars builtin check tests Signed-off-by: Andrii G <[email protected]> * update str_length, update tests Signed-off-by: cosmicBboy <[email protected]> * add test, update docstring Signed-off-by: cosmicBboy <[email protected]> --------- Signed-off-by: Andrii G <[email protected]> Signed-off-by: cosmicBboy <[email protected]> Co-authored-by: cosmicBboy <[email protected]>
Fixes #1421
The structure for most of the tests was taken from the builtin check tests for PySpark