Skip to content

Add Polars Builtin Check Tests #1518

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

AndriiG13
Copy link
Contributor

@AndriiG13 AndriiG13 commented Mar 5, 2024

Fixes #1421

The structure for most of the tests was taken from the builtin check tests for PySpark

is_in_min = (
n_chars.ge(min_value) if min_value is not None else pl.lit(True)
)
is_in_max = (
n_chars.le(max_value) if max_value is not None else pl.lit(True)
)

return data.dataframe.select(is_in_min.and_(is_in_max))
return data.dataframe.select(is_in_min.and_(is_in_max).alias(data.key))
Copy link
Contributor Author

@AndriiG13 AndriiG13 Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to explicitly alias here. If min_value is None, then resulting column name will be literal, which will break Polars lazy computation. Seems to be related to pola-rs/polars#14382

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this make sense to do instead:

if min_value is None and max_value is None:
    raise ValueError("Must provide at least on of 'min_value' and 'max_value'")

n_chars = pl.col(data.key).str.n_chars()
if min_value is None:
    expr = n_chars.le(max_value)
elif max_value is None:
    expr = n_chars.ge(min_value)
else:
    expr = n_chars.between(min_value, max_value)

return expr

@register_builtin_check(
error="str_matches('{pattern}')",
)
def str_matches(
Copy link
Contributor Author

@AndriiG13 AndriiG13 Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to remove the str_matches and keep only str_contains check with .str.contains(..., literal=False) since it works for both strings and regex.

Also I see that the str_matches check for Pandas searches for matches from the beginning of a string using the Pandas str.match() method and I don't think there is an equivalent method in Polars. So sticking just with str.contains seemed appropriate

Copy link
Collaborator

@cosmicBboy cosmicBboy Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think where we can, check backends should implement the built-in check functions, even if there isn't a direct method supported in the underlying framework.

So the implementation would be something like

    if isinstance(pattern, str):
        pattern = f"^{pattern}"
    else:
        pattern = re.compile(f"^{pattern.pattern}")
    return data.dataframe.select(
        pl.col(data.key).str.contains(pattern=pattern)
    )

@register_builtin_check(
error="str_contains('{pattern}')",
)
def str_contains(
data: PolarsData,
pattern: str,
pattern: re.Pattern,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be consistent with the public Check api: https://github.com/unionai-oss/pandera/blob/main/pandera/api/checks.py#L430

Suggested change
pattern: re.Pattern,
pattern: Union[str, re.Pattern],

@register_builtin_check(
error="str_matches('{pattern}')",
)
def str_matches(
Copy link
Collaborator

@cosmicBboy cosmicBboy Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think where we can, check backends should implement the built-in check functions, even if there isn't a direct method supported in the underlying framework.

So the implementation would be something like

    if isinstance(pattern, str):
        pattern = f"^{pattern}"
    else:
        pattern = re.compile(f"^{pattern.pattern}")
    return data.dataframe.select(
        pl.col(data.key).str.contains(pattern=pattern)
    )

is_in_min = (
n_chars.ge(min_value) if min_value is not None else pl.lit(True)
)
is_in_max = (
n_chars.le(max_value) if max_value is not None else pl.lit(True)
)

return data.dataframe.select(is_in_min.and_(is_in_max))
return data.dataframe.select(is_in_min.and_(is_in_max).alias(data.key))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this make sense to do instead:

if min_value is None and max_value is None:
    raise ValueError("Must provide at least on of 'min_value' and 'max_value'")

n_chars = pl.col(data.key).str.n_chars()
if min_value is None:
    expr = n_chars.le(max_value)
elif max_value is None:
    expr = n_chars.ge(min_value)
else:
    expr = n_chars.between(min_value, max_value)

return expr

@cosmicBboy cosmicBboy changed the base branch from polars-backend to polars-dev March 8, 2024 04:37
@cosmicBboy cosmicBboy changed the base branch from polars-dev to polars-backend March 8, 2024 04:39
@cosmicBboy cosmicBboy force-pushed the add-polars-builtin-check-tests branch from 8c817ec to dee3ab7 Compare March 8, 2024 04:41
@cosmicBboy cosmicBboy force-pushed the add-polars-builtin-check-tests branch from 8e9a651 to 6d77c6f Compare March 8, 2024 05:28
@cosmicBboy cosmicBboy changed the base branch from polars-backend to polars-dev March 8, 2024 05:29
Signed-off-by: cosmicBboy <[email protected]>
@cosmicBboy cosmicBboy force-pushed the add-polars-builtin-check-tests branch from af4e31e to 90963b5 Compare March 8, 2024 05:36
@cosmicBboy
Copy link
Collaborator

Thanks for the contribution @AndriiG13! I updated some of the checks and tests, will merge this into polars-dev now

Copy link

codecov bot commented Mar 8, 2024

Codecov Report

Attention: Patch coverage is 0% with 14 lines in your changes are missing coverage. Please review.

❗ No coverage uploaded for pull request base (polars-dev@0a24f3e). Click here to learn what that means.

Files Patch % Lines
pandera/backends/polars/builtin_checks.py 0.00% 14 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             polars-dev    #1518   +/-   ##
=============================================
  Coverage              ?   83.06%           
=============================================
  Files                 ?      110           
  Lines                 ?     8018           
  Branches              ?        0           
=============================================
  Hits                  ?     6660           
  Misses                ?     1358           
  Partials              ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cosmicBboy cosmicBboy merged commit 6b6a8bf into unionai-oss:polars-dev Mar 8, 2024
cosmicBboy added a commit that referenced this pull request Mar 11, 2024
* add polars builtin check tests

Signed-off-by: Andrii G <[email protected]>

* update str_length, update tests

Signed-off-by: cosmicBboy <[email protected]>

* add test, update docstring

Signed-off-by: cosmicBboy <[email protected]>

---------

Signed-off-by: Andrii G <[email protected]>
Signed-off-by: cosmicBboy <[email protected]>
Co-authored-by: cosmicBboy <[email protected]>
cosmicBboy added a commit that referenced this pull request Mar 11, 2024
* add polars builtin check tests

Signed-off-by: Andrii G <[email protected]>

* update str_length, update tests

Signed-off-by: cosmicBboy <[email protected]>

* add test, update docstring

Signed-off-by: cosmicBboy <[email protected]>

---------

Signed-off-by: Andrii G <[email protected]>
Signed-off-by: cosmicBboy <[email protected]>
Co-authored-by: cosmicBboy <[email protected]>
cosmicBboy added a commit that referenced this pull request Mar 15, 2024
* add polars builtin check tests

Signed-off-by: Andrii G <[email protected]>

* update str_length, update tests

Signed-off-by: cosmicBboy <[email protected]>

* add test, update docstring

Signed-off-by: cosmicBboy <[email protected]>

---------

Signed-off-by: Andrii G <[email protected]>
Signed-off-by: cosmicBboy <[email protected]>
Co-authored-by: cosmicBboy <[email protected]>
max-raphael pushed a commit to max-raphael/pandera that referenced this pull request Jan 24, 2025
* add polars builtin check tests

Signed-off-by: Andrii G <[email protected]>

* update str_length, update tests

Signed-off-by: cosmicBboy <[email protected]>

* add test, update docstring

Signed-off-by: cosmicBboy <[email protected]>

---------

Signed-off-by: Andrii G <[email protected]>
Signed-off-by: cosmicBboy <[email protected]>
Co-authored-by: cosmicBboy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants