-
-
Notifications
You must be signed in to change notification settings - Fork 336
Implement more Ibis data types and built-in checks #1906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: ibis-dev
Are you sure you want to change the base?
Implement more Ibis data types and built-in checks #1906
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## ibis-dev #1906 +/- ##
============================================
+ Coverage 93.13% 93.15% +0.02%
============================================
Files 134 134
Lines 9831 9939 +108
============================================
+ Hits 9156 9259 +103
- Misses 675 680 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
94739cd
to
5352be0
Compare
Update: Currently struggling with getting timestamp tests to work, but I just discovered |
d6e340c
to
7050f47
Compare
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
fe326db
to
8b67b07
Compare
Signed-off-by: Deepyaman Datta <[email protected]>
5bf40bc
to
43b97ff
Compare
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces additional Ibis data types and updates built‐in check implementations and their documentation across multiple backends. Key changes include:
- Corrections and clarifications in docstrings for test functions and backend check functions.
- Addition of new Ibis engine data types (e.g. UInt8, UInt16, Date, DateTime, Time, Timedelta) and corresponding built‐in checks.
- Minor refactoring in error message text and section headers for improved clarity.
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
tests/polars/test_polars_builtin_checks.py | Updated docstrings in test functions for consistency and clarity. |
pandera/engines/polars_engine.py | Revised docstring to correctly describe conversion from polars.dtype. |
pandera/engines/pandas_engine.py | Changed section header from "time" to "temporal". |
pandera/engines/ibis_engine.py | Added several new Ibis data type classes and updated formatting. |
pandera/backends/pyspark/builtin_checks.py | Improved documentation of parameters in built‐in check functions. |
pandera/backends/polars/builtin_checks.py | Refinements in docstrings for parameter descriptions. |
pandera/backends/pandas/builtin_checks.py | Docstring adjustments for consistency in check functions. |
pandera/backends/ibis/builtin_checks.py | Added helper function for mixed-unit intervals and updated docs. |
Comments suppressed due to low confidence (2)
tests/polars/test_polars_builtin_checks.py:785
- The docstring for test_not_equal_to_check (and similarly for test_greater_than_check, test_less_than_check, etc.) incorrectly states 'equal to the defined value'. It should describe the actual condition (e.g. 'not equal to', 'greater than', 'less than') to accurately reflect the test intent.
def test_not_equal_to_check(self, check_fn, datatype, data) -> None:
pandera/backends/ibis/builtin_checks.py:15
- [nitpick] A new helper '_infer_interval_with_mixed_units' has been introduced; consider adding unit tests to verify its correct handling of timedelta objects with mixed units.
def _infer_interval_with_mixed_units(value: Any) -> Any:
:param value: This value must not occur in the checked data structure. | ||
""" | ||
value = _infer_interval_with_mixed_units(value) | ||
return data.table[data.key] != value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally, built-in checks support both validating a single column (via the data.key
attribute) and the entire table.
unrelated to this PR, but as I was running tests, I needed to install |
hey @deepyaman just added a commit to this PR: 8bc1970 Feel free to revert it / improve upon it. It basically makes sure the output of built in ibis checks support both checking single columns in a table or all of the columns in a table (in the case of table-level checks) |
It's a dependency of every backend: https://github.com/ibis-project/ibis/blob/main/pyproject.toml I think the problem is that the dev dependencies don't include |
@cosmicBboy Actually, it's done here: e1afe02 I recall struggling quite a bit getting the generated requirements to line up, and it seems like this is how ended up on it. Maybe can make a separate PR to |
Co-authored-by: Copilot <[email protected]> Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: cosmicBboy <[email protected]>
06b0325
to
a7fddea
Compare
Signed-off-by: Deepyaman Datta <[email protected]>
a7fddea
to
e905a7b
Compare
Makes sense! I've updated it in 0ff0fd5 to fix the failing test. I also think it's better in that it doesn't require the user to be aware of For the checks that return tables, I've tried to be consistent with other backends in their definition, in that they return the subset of columns with boolean check results—not the full table with those columns appended. The one thing which was a bit challenging was being able to add these columns, but I've done it using a positional join, falling back to |
That makes me nervous because the check function could do all sorts of reordering. In general that won't be safe. If you want to put restrictions on the the check functions users provide then it could be fine, but that might be too restrictive for some users eg who want to do a window function. It also is going to be backend specific I think, eg I think duckdb tries to guarantee row order is maintained in a lot more places than a lot of other backends do. To be safe, I would add the |
Try writing a really nasty check function eg with joins, window functions, etc, and I bet you will find a case that breaks on some backends. |
@deepyaman the positional join makes sense. For backends that don't have it, what if we do the following: Add an The expectation should be that the check needs to return a table of boolean values for each applicable column and preserve the shape (number of rows) so that the check output is aligned 1-1 with the validated table. Just to be clear, the valid outputs of a check functions should be:
We can add some assert statements to make sure the original table and the check function output are joinable via the |
I agree the issue with this is what @NickCrews pointed out:
In a sense, I'd rather not support returning a table from a check—or, perhaps, raise a warning and recommend returning a I think it's quite fair to say we don't support/recommend returning a table, because the resulting order could get messed up, given that we're supporting a class of backends where ordering isn't guaranteed. This is quite similar to why a lot of pandas-on-X interfaces don't necessarily support methods that don't make sense. Is there any real functionality loss here, anything where the user couldn't just return a dict of columns? |
08ae457
to
fd7f5e8
Compare
Signed-off-by: Deepyaman Datta <[email protected]>
fd7f5e8
to
bec7731
Compare
Moved to draft, as I can just work off of this to add types and checks.Update: I'm still working on more built-in checks (onbetween
now), but those can just as easily be additional PRs!Will probably rebase-merge instead of squash-merging, since I'll keep the commit history as clean as possible.