Implement more Ibis data types and built-in checks #1906

deepyaman · 2025-02-08T17:56:11Z

~~Moved to draft, as I can just work off of this to add types and checks.~~ Update: I'm still working on more built-in checks (on between now), but those can just as easily be additional PRs!

Will probably rebase-merge instead of squash-merging, since I'll keep the commit history as clean as possible.

codecov · 2025-02-08T17:57:27Z

Codecov Report

Attention: Patch coverage is 96.63866% with 4 lines in your changes missing coverage. Please review.

Project coverage is 93.15%. Comparing base (e1afe02) to head (bec7731).

Files with missing lines	Patch %	Lines
pandera/backends/ibis/checks.py	62.50%	3 Missing ⚠️
pandera/engines/ibis_engine.py	98.33%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           ibis-dev    #1906      +/-   ##
============================================
+ Coverage     93.13%   93.15%   +0.02%     
============================================
  Files           134      134              
  Lines          9831     9939     +108     
============================================
+ Hits           9156     9259     +103     
- Misses          675      680       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

deepyaman · 2025-02-08T18:55:55Z

~~Requires #1907 to pass~~ Merged and rebased

deepyaman · 2025-02-14T17:49:21Z

Update: Currently struggling with getting timestamp tests to work, but I just discovered from_parametrized_dtype(), so hopefully I can unblock myself!

Signed-off-by: Deepyaman Datta <[email protected]>

Copilot

Pull Request Overview

This PR introduces additional Ibis data types and updates built‐in check implementations and their documentation across multiple backends. Key changes include:

Corrections and clarifications in docstrings for test functions and backend check functions.
Addition of new Ibis engine data types (e.g. UInt8, UInt16, Date, DateTime, Time, Timedelta) and corresponding built‐in checks.
Minor refactoring in error message text and section headers for improved clarity.

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/polars/test_polars_builtin_checks.py	Updated docstrings in test functions for consistency and clarity.
pandera/engines/polars_engine.py	Revised docstring to correctly describe conversion from polars.dtype.
pandera/engines/pandas_engine.py	Changed section header from "time" to "temporal".
pandera/engines/ibis_engine.py	Added several new Ibis data type classes and updated formatting.
pandera/backends/pyspark/builtin_checks.py	Improved documentation of parameters in built‐in check functions.
pandera/backends/polars/builtin_checks.py	Refinements in docstrings for parameter descriptions.
pandera/backends/pandas/builtin_checks.py	Docstring adjustments for consistency in check functions.
pandera/backends/ibis/builtin_checks.py	Added helper function for mixed-unit intervals and updated docs.

Comments suppressed due to low confidence (2)

tests/polars/test_polars_builtin_checks.py:785

The docstring for test_not_equal_to_check (and similarly for test_greater_than_check, test_less_than_check, etc.) incorrectly states 'equal to the defined value'. It should describe the actual condition (e.g. 'not equal to', 'greater than', 'less than') to accurately reflect the test intent.

def test_not_equal_to_check(self, check_fn, datatype, data) -> None:

pandera/backends/ibis/builtin_checks.py:15

[nitpick] A new helper '_infer_interval_with_mixed_units' has been introduced; consider adding unit tests to verify its correct handling of timedelta objects with mixed units.

def _infer_interval_with_mixed_units(value: Any) -> Any:

pandera/backends/ibis/builtin_checks.py

cosmicBboy · 2025-04-25T18:39:23Z

pandera/backends/ibis/builtin_checks.py

+    :param value: This value must not occur in the checked data structure.
+    """
+    value = _infer_interval_with_mixed_units(value)
+    return data.table[data.key] != value


ideally, built-in checks support both validating a single column (via the data.key attribute) and the entire table.

cosmicBboy · 2025-04-25T18:42:21Z

unrelated to this PR, but as I was running tests, I needed to install pyarrow_hotfix in my virtual environment. Is this something we need to add to the ibis extras dependency?

cosmicBboy · 2025-04-26T02:27:25Z

hey @deepyaman just added a commit to this PR: 8bc1970

Feel free to revert it / improve upon it. It basically makes sure the output of built in ibis checks support both checking single columns in a table or all of the columns in a table (in the case of table-level checks)

deepyaman · 2025-04-26T04:29:54Z

unrelated to this PR, but as I was running tests, I needed to install pyarrow_hotfix in my virtual environment. Is this something we need to add to the ibis extras dependency?

It's a dependency of every backend: https://github.com/ibis-project/ibis/blob/main/pyproject.toml

I think the problem is that the dev dependencies don't include ibis-framework[duckdb] (which is installed via Nox in CI), so that was probably an oversight.

deepyaman · 2025-04-26T15:39:18Z

unrelated to this PR, but as I was running tests, I needed to install pyarrow_hotfix in my virtual environment. Is this something we need to add to the ibis extras dependency?

It's a dependency of every backend: https://github.com/ibis-project/ibis/blob/main/pyproject.toml

I think the problem is that the dev dependencies don't include ibis-framework[duckdb] (which is installed via Nox in CI), so that was probably an oversight.

@cosmicBboy Actually, it's done here: e1afe02

I recall struggling quite a bit getting the generated requirements to line up, and it seems like this is how ended up on it. Maybe can make a separate PR to ibis-dev to try and add in the ibis-framework[duckdb] requirement in dev?

Co-authored-by: Copilot <[email protected]> Signed-off-by: Deepyaman Datta <[email protected]>

Signed-off-by: cosmicBboy <[email protected]>

Signed-off-by: Deepyaman Datta <[email protected]>

deepyaman · 2025-04-27T16:46:13Z

hey @deepyaman just added a commit to this PR: 8bc1970

Feel free to revert it / improve upon it. It basically makes sure the output of built in ibis checks support both checking single columns in a table or all of the columns in a table (in the case of table-level checks)

Makes sense! I've updated it in 0ff0fd5 to fix the failing test. I also think it's better in that it doesn't require the user to be aware of check_col_name/renaming.

For the checks that return tables, I've tried to be consistent with other backends in their definition, in that they return the subset of columns with boolean check results—not the full table with those columns appended.

The one thing which was a bit challenging was being able to add these columns, but I've done it using a positional join, falling back to row_number() if that doesn't exist. @cpcloud @NickCrews in case you see this, would love to sanity check—should using row_number() like this to add additional computed columns (in these cases, boolean checks) be safe in terms of giving a consistent row order? It looks like that's the recommendation in the linked issue, but I wasn't sure if this should always work logically (doing a simple test using sqlite was fine locally). Part in question: https://github.com/deepyaman/pandera/blob/e905a7b815f6ba08bf3516a4d7ce5cfa4506c308/pandera/backends/ibis/checks.py#L70-L87

NickCrews · 2025-04-28T00:16:42Z

That makes me nervous because the check function could do all sorts of reordering. In general that won't be safe. If you want to put restrictions on the the check functions users provide then it could be fine, but that might be too restrictive for some users eg who want to do a window function. It also is going to be backend specific I think, eg I think duckdb tries to guarantee row order is maintained in a lot more places than a lot of other backends do. To be safe, I would add the __index__ column before the check function, then join on that afterwards. That's a little weird because then users check functions will have to expect this extra metadata column. No perfect solution.

NickCrews · 2025-04-28T00:19:05Z

Try writing a really nasty check function eg with joins, window functions, etc, and I bet you will find a case that breaks on some backends.

cosmicBboy · 2025-04-28T01:58:33Z

@deepyaman the positional join makes sense. For backends that don't have it, what if we do the following:

Add an __idx__ column before invoking the check_fn here so that the join index is already defined. This way, the __idx__ column can remain consistent even of the check function does reordering.

The expectation should be that the check needs to return a table of boolean values for each applicable column and preserve the shape (number of rows) so that the check output is aligned 1-1 with the validated table. Just to be clear, the valid outputs of a check functions should be:

a scalar boolean: True if passed, False if failed
an index-aligned boolean column: this can produce failure cases since we know which values violated the check
an index-unaligned boolean column: this shouldn't be common, but this should be reduced to a scalar boolean with an all (at least one False value) operation since we don't know what the user has done to aggregate/reduce the number of elements of the check output. Therefore, we can't report any failure cases here.
an index-aligned boolean table: this can produce failure cases for rows that contains a violated check
an index-unaligned boolean table: this shouldn't be common, but this should also be reduced a scalar boolean with an all operation since we don't know what the user has dont to aggregate/reduce the dimensionality of the table. Therefore, we can't report any failure cases here.

We can add some assert statements to make sure the original table and the check function output are joinable via the __idx__ column

deepyaman · 2025-04-28T02:47:14Z

Add an __idx__ column before invoking the check_fn here so that the join index is already defined. This way, the __idx__ column can remain consistent even of the check function does reordering.

I agree the issue with this is what @NickCrews pointed out:

To be safe, I would add the __index__ column before the check function, then join on that afterwards. That's a little weird because then users check functions will have to expect this extra metadata column. No perfect solution.

In a sense, I'd rather not support returning a table from a check—or, perhaps, raise a warning and recommend returning a dict of columns instead—if this is risky, rather than expose the hack of needing to select the __idx__ column to the user. You can't even easily decorate this, because if the user doesn't select the column, you can't go inject it back at the end.

I think it's quite fair to say we don't support/recommend returning a table, because the resulting order could get messed up, given that we're supporting a class of backends where ordering isn't guaranteed. This is quite similar to why a lot of pandas-on-X interfaces don't necessarily support methods that don't make sense. Is there any real functionality loss here, anything where the user couldn't just return a dict of columns?

pandera/backends/pandas/builtin_checks.py

Signed-off-by: Deepyaman Datta <[email protected]>

deepyaman force-pushed the feat/ibis/more-builtin-checks branch 2 times, most recently from 94739cd to 5352be0 Compare February 9, 2025 21:11

deepyaman marked this pull request as draft February 9, 2025 21:11

deepyaman changed the title ~~Implement "ne" built-in check for the Ibis backend~~ Implement more Ibis data types and built-in checks Feb 9, 2025

deepyaman changed the title ~~Implement more Ibis data types and built-in checks~~ [WIP] Implement more Ibis data types and built-in checks Feb 9, 2025

deepyaman force-pushed the feat/ibis/more-builtin-checks branch 2 times, most recently from d6e340c to 7050f47 Compare March 10, 2025 17:51

deepyaman force-pushed the ibis-dev branch from b2fb624 to e1afe02 Compare April 6, 2025 19:54

deepyaman added 7 commits April 6, 2025 13:58

Implement "ne" built-in check for the Ibis backend

85c06ba

Signed-off-by: Deepyaman Datta <[email protected]>

Implement int/uint/float types, except for float16

13485d0

Signed-off-by: Deepyaman Datta <[email protected]>

Implement timestamp type, and test built-in checks

f2f61d1

Signed-off-by: Deepyaman Datta <[email protected]>

Support built-in checks for interval-typed columns

951c6f1

Signed-off-by: Deepyaman Datta <[email protected]>

Blacken pandera/engines/ibis_engine.py module code

4ef8eb6

Signed-off-by: Deepyaman Datta <[email protected]>

Implement dt.Date type, and test built-in checks

b915c5e

Signed-off-by: Deepyaman Datta <[email protected]>

Implement dt.Time type, and test built-in checks

8b67b07

Signed-off-by: Deepyaman Datta <[email protected]>

deepyaman force-pushed the feat/ibis/more-builtin-checks branch from fe326db to 8b67b07 Compare April 6, 2025 19:59

Implement gt and ge check for the Ibis backend

43b97ff

Signed-off-by: Deepyaman Datta <[email protected]>

deepyaman force-pushed the feat/ibis/more-builtin-checks branch from 5bf40bc to 43b97ff Compare April 7, 2025 03:48

deepyaman added 3 commits April 20, 2025 07:54

Standardize docstrings, don't say "data container"

86fe561

Signed-off-by: Deepyaman Datta <[email protected]>

Implement lt and le check for the Ibis backend

84fc587

Signed-off-by: Deepyaman Datta <[email protected]>

Fix "form" to "from", and align docstring for test

eca9d00

Signed-off-by: Deepyaman Datta <[email protected]>

deepyaman marked this pull request as ready for review April 24, 2025 17:31

deepyaman changed the title ~~[WIP] Implement more Ibis data types and built-in checks~~ Implement more Ibis data types and built-in checks Apr 24, 2025

deepyaman requested review from cosmicBboy and Copilot April 24, 2025 17:32

Copilot AI reviewed Apr 24, 2025

View reviewed changes

pandera/backends/ibis/builtin_checks.py Outdated Show resolved Hide resolved

cosmicBboy reviewed Apr 25, 2025

View reviewed changes

deepyaman and others added 2 commits April 27, 2025 10:07

Apply suggestion from Copilot to fix error message

916283c

Co-authored-by: Copilot <[email protected]> Signed-off-by: Deepyaman Datta <[email protected]>

built-in checks support table-level checks

b735ccb

Signed-off-by: cosmicBboy <[email protected]>

deepyaman force-pushed the feat/ibis/more-builtin-checks branch 4 times, most recently from 06b0325 to a7fddea Compare April 27, 2025 16:27

Support table-level checks, including for built-in

e905a7b

Signed-off-by: Deepyaman Datta <[email protected]>

deepyaman force-pushed the feat/ibis/more-builtin-checks branch from a7fddea to e905a7b Compare April 27, 2025 16:31

deepyaman force-pushed the feat/ibis/more-builtin-checks branch from 08ae457 to fd7f5e8 Compare April 28, 2025 05:11

deepyaman commented Apr 28, 2025

View reviewed changes

pandera/backends/pandas/builtin_checks.py Outdated Show resolved Hide resolved

Implement is_in_range check for the Ibis backend

bec7731

Signed-off-by: Deepyaman Datta <[email protected]>

deepyaman force-pushed the feat/ibis/more-builtin-checks branch from fd7f5e8 to bec7731 Compare May 18, 2025 20:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement more Ibis data types and built-in checks #1906

Implement more Ibis data types and built-in checks #1906

deepyaman commented Feb 8, 2025 •

edited

Loading

codecov bot commented Feb 8, 2025 •

edited

Loading

deepyaman commented Feb 8, 2025 •

edited

Loading

deepyaman commented Feb 14, 2025

Copilot AI left a comment

cosmicBboy Apr 25, 2025

cosmicBboy commented Apr 25, 2025

cosmicBboy commented Apr 26, 2025

deepyaman commented Apr 26, 2025

deepyaman commented Apr 26, 2025

deepyaman commented Apr 27, 2025

NickCrews commented Apr 28, 2025

NickCrews commented Apr 28, 2025

cosmicBboy commented Apr 28, 2025 •

edited

Loading

deepyaman commented Apr 28, 2025

Implement more Ibis data types and built-in checks #1906

Are you sure you want to change the base?

Implement more Ibis data types and built-in checks #1906

Conversation

deepyaman commented Feb 8, 2025 • edited Loading

codecov bot commented Feb 8, 2025 • edited Loading

Codecov Report

deepyaman commented Feb 8, 2025 • edited Loading

deepyaman commented Feb 14, 2025

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

cosmicBboy Apr 25, 2025

Choose a reason for hiding this comment

cosmicBboy commented Apr 25, 2025

cosmicBboy commented Apr 26, 2025

deepyaman commented Apr 26, 2025

deepyaman commented Apr 26, 2025

deepyaman commented Apr 27, 2025

NickCrews commented Apr 28, 2025

NickCrews commented Apr 28, 2025

cosmicBboy commented Apr 28, 2025 • edited Loading

deepyaman commented Apr 28, 2025

deepyaman commented Feb 8, 2025 •

edited

Loading

codecov bot commented Feb 8, 2025 •

edited

Loading

deepyaman commented Feb 8, 2025 •

edited

Loading

cosmicBboy commented Apr 28, 2025 •

edited

Loading