Skip to content

Implement more Ibis data types and built-in checks #1906

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: ibis-dev
Choose a base branch
from

Conversation

deepyaman
Copy link
Collaborator

@deepyaman deepyaman commented Feb 8, 2025

Moved to draft, as I can just work off of this to add types and checks. Update: I'm still working on more built-in checks (on between now), but those can just as easily be additional PRs!

Will probably rebase-merge instead of squash-merging, since I'll keep the commit history as clean as possible.

Copy link

codecov bot commented Feb 8, 2025

Codecov Report

Attention: Patch coverage is 96.63866% with 4 lines in your changes missing coverage. Please review.

Project coverage is 93.15%. Comparing base (e1afe02) to head (bec7731).

Files with missing lines Patch % Lines
pandera/backends/ibis/checks.py 62.50% 3 Missing ⚠️
pandera/engines/ibis_engine.py 98.33% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           ibis-dev    #1906      +/-   ##
============================================
+ Coverage     93.13%   93.15%   +0.02%     
============================================
  Files           134      134              
  Lines          9831     9939     +108     
============================================
+ Hits           9156     9259     +103     
- Misses          675      680       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@deepyaman
Copy link
Collaborator Author

deepyaman commented Feb 8, 2025

Requires #1907 to pass Merged and rebased

@deepyaman deepyaman force-pushed the feat/ibis/more-builtin-checks branch 2 times, most recently from 94739cd to 5352be0 Compare February 9, 2025 21:11
@deepyaman deepyaman marked this pull request as draft February 9, 2025 21:11
@deepyaman deepyaman changed the title Implement "ne" built-in check for the Ibis backend Implement more Ibis data types and built-in checks Feb 9, 2025
@deepyaman deepyaman changed the title Implement more Ibis data types and built-in checks [WIP] Implement more Ibis data types and built-in checks Feb 9, 2025
@deepyaman
Copy link
Collaborator Author

Update: Currently struggling with getting timestamp tests to work, but I just discovered from_parametrized_dtype(), so hopefully I can unblock myself!

@deepyaman deepyaman force-pushed the feat/ibis/more-builtin-checks branch 2 times, most recently from d6e340c to 7050f47 Compare March 10, 2025 17:51
@deepyaman deepyaman force-pushed the feat/ibis/more-builtin-checks branch from fe326db to 8b67b07 Compare April 6, 2025 19:59
@deepyaman deepyaman force-pushed the feat/ibis/more-builtin-checks branch from 5bf40bc to 43b97ff Compare April 7, 2025 03:48
@deepyaman deepyaman marked this pull request as ready for review April 24, 2025 17:31
@deepyaman deepyaman changed the title [WIP] Implement more Ibis data types and built-in checks Implement more Ibis data types and built-in checks Apr 24, 2025
@deepyaman deepyaman requested review from cosmicBboy and Copilot April 24, 2025 17:32
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces additional Ibis data types and updates built‐in check implementations and their documentation across multiple backends. Key changes include:

  • Corrections and clarifications in docstrings for test functions and backend check functions.
  • Addition of new Ibis engine data types (e.g. UInt8, UInt16, Date, DateTime, Time, Timedelta) and corresponding built‐in checks.
  • Minor refactoring in error message text and section headers for improved clarity.

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/polars/test_polars_builtin_checks.py Updated docstrings in test functions for consistency and clarity.
pandera/engines/polars_engine.py Revised docstring to correctly describe conversion from polars.dtype.
pandera/engines/pandas_engine.py Changed section header from "time" to "temporal".
pandera/engines/ibis_engine.py Added several new Ibis data type classes and updated formatting.
pandera/backends/pyspark/builtin_checks.py Improved documentation of parameters in built‐in check functions.
pandera/backends/polars/builtin_checks.py Refinements in docstrings for parameter descriptions.
pandera/backends/pandas/builtin_checks.py Docstring adjustments for consistency in check functions.
pandera/backends/ibis/builtin_checks.py Added helper function for mixed-unit intervals and updated docs.
Comments suppressed due to low confidence (2)

tests/polars/test_polars_builtin_checks.py:785

  • The docstring for test_not_equal_to_check (and similarly for test_greater_than_check, test_less_than_check, etc.) incorrectly states 'equal to the defined value'. It should describe the actual condition (e.g. 'not equal to', 'greater than', 'less than') to accurately reflect the test intent.
def test_not_equal_to_check(self, check_fn, datatype, data) -> None:

pandera/backends/ibis/builtin_checks.py:15

  • [nitpick] A new helper '_infer_interval_with_mixed_units' has been introduced; consider adding unit tests to verify its correct handling of timedelta objects with mixed units.
def _infer_interval_with_mixed_units(value: Any) -> Any:

:param value: This value must not occur in the checked data structure.
"""
value = _infer_interval_with_mixed_units(value)
return data.table[data.key] != value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally, built-in checks support both validating a single column (via the data.key attribute) and the entire table.

@cosmicBboy
Copy link
Collaborator

unrelated to this PR, but as I was running tests, I needed to install pyarrow_hotfix in my virtual environment. Is this something we need to add to the ibis extras dependency?

@cosmicBboy
Copy link
Collaborator

hey @deepyaman just added a commit to this PR: 8bc1970

Feel free to revert it / improve upon it. It basically makes sure the output of built in ibis checks support both checking single columns in a table or all of the columns in a table (in the case of table-level checks)

@deepyaman
Copy link
Collaborator Author

unrelated to this PR, but as I was running tests, I needed to install pyarrow_hotfix in my virtual environment. Is this something we need to add to the ibis extras dependency?

It's a dependency of every backend: https://github.com/ibis-project/ibis/blob/main/pyproject.toml

I think the problem is that the dev dependencies don't include ibis-framework[duckdb] (which is installed via Nox in CI), so that was probably an oversight.

@deepyaman
Copy link
Collaborator Author

unrelated to this PR, but as I was running tests, I needed to install pyarrow_hotfix in my virtual environment. Is this something we need to add to the ibis extras dependency?

It's a dependency of every backend: https://github.com/ibis-project/ibis/blob/main/pyproject.toml

I think the problem is that the dev dependencies don't include ibis-framework[duckdb] (which is installed via Nox in CI), so that was probably an oversight.

@cosmicBboy Actually, it's done here: e1afe02

I recall struggling quite a bit getting the generated requirements to line up, and it seems like this is how ended up on it. Maybe can make a separate PR to ibis-dev to try and add in the ibis-framework[duckdb] requirement in dev?

@deepyaman deepyaman force-pushed the feat/ibis/more-builtin-checks branch 4 times, most recently from 06b0325 to a7fddea Compare April 27, 2025 16:27
@deepyaman deepyaman force-pushed the feat/ibis/more-builtin-checks branch from a7fddea to e905a7b Compare April 27, 2025 16:31
@deepyaman
Copy link
Collaborator Author

hey @deepyaman just added a commit to this PR: 8bc1970

Feel free to revert it / improve upon it. It basically makes sure the output of built in ibis checks support both checking single columns in a table or all of the columns in a table (in the case of table-level checks)

Makes sense! I've updated it in 0ff0fd5 to fix the failing test. I also think it's better in that it doesn't require the user to be aware of check_col_name/renaming.

For the checks that return tables, I've tried to be consistent with other backends in their definition, in that they return the subset of columns with boolean check results—not the full table with those columns appended.

The one thing which was a bit challenging was being able to add these columns, but I've done it using a positional join, falling back to row_number() if that doesn't exist. @cpcloud @NickCrews in case you see this, would love to sanity check—should using row_number() like this to add additional computed columns (in these cases, boolean checks) be safe in terms of giving a consistent row order? It looks like that's the recommendation in the linked issue, but I wasn't sure if this should always work logically (doing a simple test using sqlite was fine locally). Part in question: https://github.com/deepyaman/pandera/blob/e905a7b815f6ba08bf3516a4d7ce5cfa4506c308/pandera/backends/ibis/checks.py#L70-L87

@NickCrews
Copy link
Contributor

That makes me nervous because the check function could do all sorts of reordering. In general that won't be safe. If you want to put restrictions on the the check functions users provide then it could be fine, but that might be too restrictive for some users eg who want to do a window function. It also is going to be backend specific I think, eg I think duckdb tries to guarantee row order is maintained in a lot more places than a lot of other backends do. To be safe, I would add the __index__ column before the check function, then join on that afterwards. That's a little weird because then users check functions will have to expect this extra metadata column. No perfect solution.

@NickCrews
Copy link
Contributor

Try writing a really nasty check function eg with joins, window functions, etc, and I bet you will find a case that breaks on some backends.

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Apr 28, 2025

@deepyaman the positional join makes sense. For backends that don't have it, what if we do the following:

Add an __idx__ column before invoking the check_fn here so that the join index is already defined. This way, the __idx__ column can remain consistent even of the check function does reordering.

The expectation should be that the check needs to return a table of boolean values for each applicable column and preserve the shape (number of rows) so that the check output is aligned 1-1 with the validated table. Just to be clear, the valid outputs of a check functions should be:

  • a scalar boolean: True if passed, False if failed
  • an index-aligned boolean column: this can produce failure cases since we know which values violated the check
  • an index-unaligned boolean column: this shouldn't be common, but this should be reduced to a scalar boolean with an all (at least one False value) operation since we don't know what the user has done to aggregate/reduce the number of elements of the check output. Therefore, we can't report any failure cases here.
  • an index-aligned boolean table: this can produce failure cases for rows that contains a violated check
  • an index-unaligned boolean table: this shouldn't be common, but this should also be reduced a scalar boolean with an all operation since we don't know what the user has dont to aggregate/reduce the dimensionality of the table. Therefore, we can't report any failure cases here.

We can add some assert statements to make sure the original table and the check function output are joinable via the __idx__ column

@deepyaman
Copy link
Collaborator Author

Add an __idx__ column before invoking the check_fn here so that the join index is already defined. This way, the __idx__ column can remain consistent even of the check function does reordering.

I agree the issue with this is what @NickCrews pointed out:

To be safe, I would add the __index__ column before the check function, then join on that afterwards. That's a little weird because then users check functions will have to expect this extra metadata column. No perfect solution.

In a sense, I'd rather not support returning a table from a check—or, perhaps, raise a warning and recommend returning a dict of columns instead—if this is risky, rather than expose the hack of needing to select the __idx__ column to the user. You can't even easily decorate this, because if the user doesn't select the column, you can't go inject it back at the end.

I think it's quite fair to say we don't support/recommend returning a table, because the resulting order could get messed up, given that we're supporting a class of backends where ordering isn't guaranteed. This is quite similar to why a lot of pandas-on-X interfaces don't necessarily support methods that don't make sense. Is there any real functionality loss here, anything where the user couldn't just return a dict of columns?

@deepyaman deepyaman force-pushed the feat/ibis/more-builtin-checks branch from 08ae457 to fd7f5e8 Compare April 28, 2025 05:11
@deepyaman deepyaman force-pushed the feat/ibis/more-builtin-checks branch from fd7f5e8 to bec7731 Compare May 18, 2025 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants