-
-
Notifications
You must be signed in to change notification settings - Fork 344
Support for logical data types #798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* ENH: add duplicate detection to dataframeschema * ENH: propagate duplicate colnames check to schemamodel * Add getter setter property * make schemamodel actually work, update __str__ * fix __repr__ as well * fix incorrect default value * black formatting has changed * invert parameter naming convention * address other PR comments * fix doctests, comma in __str__ * maybe fix sphinx errors * fix ci and mypy tests * Update test_schemas.py * fix lint Co-authored-by: cosmicBboy <[email protected]>
* Make SchemaModel use class name, define own config * fix * fix * fix * fix tests * fix lint and docs * add test Co-authored-by: cosmicBboy <[email protected]>
…unionai-oss#772) * implement coercion-on-initialization * pylint * Update tests/core/test_model.py Co-authored-by: Matt Richards <[email protected]> Co-authored-by: Matt Richards <[email protected]>
* add documentation for pandas_engine.DateTime * fix removed numpy_engine.Object doc
* Update filtering columns for performance reasons. * Update pandera/schemas.py * Update schemas.py * Update schemas.py * Bugfix in schemas.py Co-authored-by: Niels Bantilan <[email protected]>
* make finding coerce failure cases faster * fix tests * remove unneeded import * fix tests, coverage
* add support for pyspark.pandas, deprecate koalas * update docs * add type check in pandas generics * update docs * clean up ci * fix mypy, generics * fix generic hack * improve coverage
* Add overloads to `to_yaml` * Update schemas.py Co-authored-by: Niels Bantilan <[email protected]>
Here is a copy/pastable example to play with: import pandas as pd
import pandera as pa
from decimal import Decimal
from pandera.engines import pandas_engine
data = pd.DataFrame({"col": [Decimal("999.99") for _ in range(3)] + ["foobar"]})
schema = pa.DataFrameSchema({"col": pa.Column(pandas_engine.Decimal(5, 2))})
schema.validate(data) |
This looks good overall @jeffzi ! We might consider a # dtypes module
class DataType(ABC): ...
def check_dtype(self, pandera_dtype): ... # basically the same as current check method
class LogicalDataType(DataType): ...
# logical data types defines an additional method
def check_value(self, data_container): ... |
#807 also seems like a good use case for this... there seems to be a regression in the Before, |
I wanted to avoid a test Another option is to go back to my initial idea Some points to note:
|
@cosmicBboy Friendly ping :) |
Agreed!
Let's go for this option for now. When we learn more and have a better sense of how physical/logical dtypes work we can think about a cleaner (but breaking-change) interface... it's funny to think this project is getting mature enough where |
Thanks for your input. I agree to not break things now. Probably a decision to make before pandera hits 1.0 😎 |
* Adapt SchemaModel so that it can inherit from typing.Generic * Extend SchemaModel to enable generic types in fields * fix linter Co-authored-by: Thomas Willems <[email protected]> Co-authored-by: cosmicBboy <[email protected]>
…nionai-oss#827) * pyspark docs fixes * fix koalas link to pyspark * bump version 0.10.1 * fix pandas_engine.DateTime.coerce_value not consistent with coerce Co-authored-by: cosmicBboy <[email protected]>
@cosmicBboy I'm having trouble with the modin-ray tests. They pass on my local machine but fail for python < 3.10 in the CI. I did move the Could you have a look please? Other than that, I did manage to make Decimal work with pyspark because pyarrow complains about Decimal in a series typed with Let me know if the structure of testing works for you and I'll add placeholders for logical types tests in tests/pyspark. In the future, I think a similar structure could work to test regular data types too. |
Thanks @jeffzi, lemme take a look |
just looking at the error messages:
Seems like that But why would the @pytest.mark.parametrize("coerce", [True, False])
def test_dataframe_schema_case(coerce):
"""Test a simple schema case."""
schema = pa.DataFrameSchema(
{
"int_column": pa.Column(int, pa.Check.ge(0)),
"float_column": pa.Column(float, pa.Check.le(0)),
"str_column": pa.Column(str, pa.Check.isin(list("abcde"))),
},
coerce=coerce,
)
mdf = mpd.DataFrame(
{
"int_column": range(10),
"float_column": [float(-x) for x in range(10)],
"str_column": list("aabbcceedd"),
}
)
> assert isinstance(schema.validate(mdf), mpd.DataFrame) |
I was able to reproduce the error in CI with |
looks like modin isn't happy when you do operations on empty series, case in point:
it's failing in
|
hey @jeffzi is this PR ready for review? also, if you rebase onto |
Sorry for not following up earlier. Merging dev fixed the fastapi failures. The PR is ready for review if you don't mind the missing documentation. I will add it this week-end and we can finally wrap up the PR. |
@cosmicBboy Added documentation and CI passes except for slight codecov decrease. |
Thanks @jeffzi! Lemme give it a final look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good! just updated the docs a little and inline comment re: tests for ValueError
execution paths
Amazing @jeffzi ! 🚀 |
* add imports to fastapi docs * Add option to disallow duplicate column names (#758) * ENH: add duplicate detection to dataframeschema * ENH: propagate duplicate colnames check to schemamodel * Add getter setter property * make schemamodel actually work, update __str__ * fix __repr__ as well * fix incorrect default value * black formatting has changed * invert parameter naming convention * address other PR comments * fix doctests, comma in __str__ * maybe fix sphinx errors * fix ci and mypy tests * Update test_schemas.py * fix lint Co-authored-by: cosmicBboy <[email protected]> * Make SchemaModel use class name, define own config (#761) * Make SchemaModel use class name, define own config * fix * fix * fix * fix tests * fix lint and docs * add test Co-authored-by: cosmicBboy <[email protected]> * implement coercion-on-initialization for DataFrame[SchemaModel] types (#772) * implement coercion-on-initialization * pylint * Update tests/core/test_model.py Co-authored-by: Matt Richards <[email protected]> Co-authored-by: Matt Richards <[email protected]> * update conda install instructions (#776) * add documentation for pandas_engine.DateTime (#780) * add documentation for pandas_engine.DateTime * fix removed numpy_engine.Object doc * set default n_failure_cases to None (#784) * Update filtering columns for performance reasons. (#777) * Update filtering columns for performance reasons. * Update pandera/schemas.py * Update schemas.py * Update schemas.py * Bugfix in schemas.py Co-authored-by: Niels Bantilan <[email protected]> * implement pydantic model data type (#779) * make finding coerce failure cases faster (#792) * make finding coerce failure cases faster * fix tests * remove unneeded import * fix tests, coverage * update docs for 0.10.0 (#795) * add pyspark support, deprecate koalas (#793) * add support for pyspark.pandas, deprecate koalas * update docs * add type check in pandas generics * update docs * clean up ci * fix mypy, generics * fix generic hack * improve coverage * Add overloads to `schema.to_yaml` (#790) * Add overloads to `to_yaml` * Update schemas.py Co-authored-by: Niels Bantilan <[email protected]> * add support for logical data types * add initial support for decimal * fix dtype check * Feature: Add support for Generic to SchemaModel (#810) * Adapt SchemaModel so that it can inherit from typing.Generic * Extend SchemaModel to enable generic types in fields * fix linter Co-authored-by: Thomas Willems <[email protected]> Co-authored-by: cosmicBboy <[email protected]> * fix pandas_engine.DateTime.coerce_value not consistent with coerce (#827) * pyspark docs fixes * fix koalas link to pyspark * bump version 0.10.1 * fix pandas_engine.DateTime.coerce_value not consistent with coerce Co-authored-by: cosmicBboy <[email protected]> * Refactor logical type check method * add logical types tests * add back conftest * fix test_invalid_annotations * fix ray initialization in setup_modin_engine * fix logical type validation when output is an iterable * add Decimal data type to pandera.__init__ * remove DataType.is_logical * add logical types documentation * Update dtypes.rst * Update dtypes.rst * increase coverage * fix SchemaErrors.failure_cases with logical types * fix modin compatibility for logical type validation * fix prepare_series_check_output compatibility with pyspark * fix mypy error * Update dtypes.rst Co-authored-by: cosmicBboy <[email protected]> Co-authored-by: Matt Richards <[email protected]> Co-authored-by: Sean Mackesey <[email protected]> Co-authored-by: Ferdinand Hahmann <[email protected]> Co-authored-by: Robert Craigie <[email protected]> Co-authored-by: tfwillems <[email protected]> Co-authored-by: Thomas Willems <[email protected]>
Fixes #788
So far Pandera only support physical data types, i.e. have a 1-1 relationship with pandas/numpy dtypes. Logical data types represent the abstracted understanding of that data.
The use-cases are:
Logical types that consist of a pandas dtype + a check on values: IP, URLs, paths. These can currently be validated with a
Check
but coercion is not possible.Dtypes unofficially supported by pandas: date, decimal, etc..Another example is the new PydanticModel introduced in implement pydantic model data type #779
This PR is a proof-of-concept. I've I added an attribute
DataType.is_logical
. WhenTrue
, we expectDataType.check
to return a mask of valid data, similar to the output of a check function. This is necessary to report failure cases, which was impossible in my initial proposal #788.@cosmicBboy I tested this approach with the Decimal data type. I'd like to have your opinion before cleaning up the code and adding robust tests. I played with returning a
Check
orCheckResult
class but did not find it very elegant.