Support for logical data types #798

jeffzi · 2022-03-24T22:30:22Z

Fixes #788

So far Pandera only support physical data types, i.e. have a 1-1 relationship with pandas/numpy dtypes. Logical data types represent the abstracted understanding of that data.

The use-cases are:

Logical types that consist of a pandas dtype + a check on values: IP, URLs, paths. These can currently be validated with a Check but coercion is not possible.
Dtypes unofficially supported by pandas: date, decimal, etc..Another example is the new PydanticModel introduced in implement pydantic model data type #779

This PR is a proof-of-concept. I've I added an attribute DataType.is_logical. When True, we expect DataType.check to return a mask of valid data, similar to the output of a check function. This is necessary to report failure cases, which was impossible in my initial proposal #788.

@cosmicBboy I tested this approach with the Decimal data type. I'd like to have your opinion before cleaning up the code and adding robust tests. I played with returning a Check or CheckResult class but did not find it very elegant.

* ENH: add duplicate detection to dataframeschema * ENH: propagate duplicate colnames check to schemamodel * Add getter setter property * make schemamodel actually work, update __str__ * fix __repr__ as well * fix incorrect default value * black formatting has changed * invert parameter naming convention * address other PR comments * fix doctests, comma in __str__ * maybe fix sphinx errors * fix ci and mypy tests * Update test_schemas.py * fix lint Co-authored-by: cosmicBboy <[email protected]>

* Make SchemaModel use class name, define own config * fix * fix * fix * fix tests * fix lint and docs * add test Co-authored-by: cosmicBboy <[email protected]>

…unionai-oss#772) * implement coercion-on-initialization * pylint * Update tests/core/test_model.py Co-authored-by: Matt Richards <[email protected]> Co-authored-by: Matt Richards <[email protected]>

* add documentation for pandas_engine.DateTime * fix removed numpy_engine.Object doc

* Update filtering columns for performance reasons. * Update pandera/schemas.py * Update schemas.py * Update schemas.py * Bugfix in schemas.py Co-authored-by: Niels Bantilan <[email protected]>

* make finding coerce failure cases faster * fix tests * remove unneeded import * fix tests, coverage

* add support for pyspark.pandas, deprecate koalas * update docs * add type check in pandas generics * update docs * clean up ci * fix mypy, generics * fix generic hack * improve coverage

* Add overloads to `to_yaml` * Update schemas.py Co-authored-by: Niels Bantilan <[email protected]>

jeffzi · 2022-03-24T22:38:38Z

Here is a copy/pastable example to play with:

import pandas as pd
import pandera as pa
from decimal import Decimal
from pandera.engines import pandas_engine


data = pd.DataFrame({"col": [Decimal("999.99") for _ in range(3)] + ["foobar"]})

schema = pa.DataFrameSchema({"col": pa.Column(pandas_engine.Decimal(5, 2))})
schema.validate(data)

cosmicBboy · 2022-03-26T16:58:05Z

This looks good overall @jeffzi !

We might consider a DataType subclass for LogicalDataType... I think it might provide a cleaner interface for the check method. Also, perhaps two separate methods for clarity might make sense: check_dtype for the current check interface, and check_value for checking an actual value.

# dtypes module

class DataType(ABC): ...

    def check_dtype(self, pandera_dtype): ...  # basically the same as current check method

class LogicalDataType(DataType): ...

    # logical data types defines an additional method
    def check_value(self, data_container): ...

cosmicBboy · 2022-03-29T12:47:15Z

#807 also seems like a good use case for this... there seems to be a regression in the str type-checking behavior.

Before, str was special-cased since it used the numpy object datatype to represent strings... so pandera would actually check the values of the object array to make sure all the values were actually strings and not other types.

jeffzi · 2022-03-31T20:52:15Z

We might consider a DataType subclass for LogicalDataType.

I wanted to avoid a test isinstance(self.dtype, LogicalDataType), which is less pythonic imho... yeah I know 😓. Another issue is that check_dtype is a prerequisite for check_value (e.g. Decimal requires object dtype, IP requires str) but you'd still need to call check_value as well to have complete validation. I would prefer DataFrameSchema to not know about the subtleties of data types.

Another option is to go back to my initial idea def check(self, pandera_dtype: DataType, data_container: Optional[Any]=None) -> Union[bool, Sequence[bool]). DataFrameSchema can give the data container every time and produce an appropriate SchemaError depending on the results (wrapping failure cases and appropriate error message).

Some points to note:

This design does not make the data_container mandatory for logical types even if we know it is necessary. pandera_dtype can also be inferred from data_container (currently done prior to calling check).We could change the signature to check(self, data_container). It would require to refactor internal pandera tests. It's a breaking change to the public API but should only affect a minority of users. It would still require a deprecation warning one version ahead just in case.
It also does not force check to return a sequence of booleans for logical types. That said, it makes sense to allow to return a scalar bool if a data type is costly to validate and/or the volume of data is high, and the user does not care about failure cases.

jeffzi · 2022-04-07T21:28:52Z

@cosmicBboy Friendly ping :)

cosmicBboy · 2022-04-07T21:42:09Z

I would prefer DataFrameSchema to not know about the subtleties of data types.

Agreed!

Another option is to go back to my initial idea def check(self, pandera_dtype: DataType, data_container: Optional[Any]=None) -> Union[bool, Sequence[bool])

Let's go for this option for now. check(self, data_container) is cleaner in principle, but I'd prioritize no breaking changes over clean-ness, and I think it's fairly intuitive to have an optional arg for the data_container in the case that the physical datatype isn't enough to verify the logical datatype.

When we learn more and have a better sense of how physical/logical dtypes work we can think about a cleaner (but breaking-change) interface... it's funny to think this project is getting mature enough where 1.0.0 is something to think about in the medium term 😅.

jeffzi · 2022-04-07T21:51:46Z

Thanks for your input. I agree to not break things now. Probably a decision to make before pandera hits 1.0 😎

* Adapt SchemaModel so that it can inherit from typing.Generic * Extend SchemaModel to enable generic types in fields * fix linter Co-authored-by: Thomas Willems <[email protected]> Co-authored-by: cosmicBboy <[email protected]>

…nionai-oss#827) * pyspark docs fixes * fix koalas link to pyspark * bump version 0.10.1 * fix pandas_engine.DateTime.coerce_value not consistent with coerce Co-authored-by: cosmicBboy <[email protected]>

jeffzi · 2022-05-08T11:41:15Z

@cosmicBboy I'm having trouble with the modin-ray tests.

They pass on my local machine but fail for python < 3.10 in the CI. I did move the setup_modin_engine fixture to tests/modin/conftest.py in order to share it with the logical dtype tests. I don't see how this would cause the ci to fail. I'm probably missing something since I'm not very familiar with modin and ray.

Could you have a look please?

Other than that, I did manage to make Decimal work with pyspark because pyarrow complains about Decimal in a series typed with object. As far as I can tell, we need a pandas udf. I'd rather wait for the refactor of pandera's handling of data container libraries before going down that road.

Let me know if the structure of testing works for you and I'll add placeholders for logical types tests in tests/pyspark. In the future, I think a similar structure could work to test regular data types too.

cosmicBboy · 2022-05-08T13:05:18Z

Thanks @jeffzi, lemme take a look

cosmicBboy · 2022-05-08T17:04:22Z

just looking at the error messages:

  File "/usr/share/miniconda/envs/pandera-dev/lib/python3.9/site-packages/pandas/core/apply.py", line 851, in apply_standard
    results, res_index = self.apply_series_generator()
  File "/usr/share/miniconda/envs/pandera-dev/lib/python3.9/site-packages/pandas/core/apply.py", line 867, in apply_series_generator
    results[i] = self.f(v)
  File "/usr/share/miniconda/envs/pandera-dev/lib/python3.9/site-packages/pandas/core/frame.py", line 8922, in infer
    return lib.map_infer(x.astype(object)._values, func, ignore_na=ignore_na)
  File "pandas/_libs/lib.pyx", line 2870, in pandas._libs.lib.map_infer
  File "/usr/share/miniconda/envs/pandera-dev/lib/python3.9/site-packages/modin/pandas/series.py", line 1248, in <lambda>
    lambda s: arg(s)
  File "/home/runner/work/pandera/pandera/pandera/engines/pandas_engine.py", line 503, in coerce_value
    return dec.quantize(self._exp, context=self._ctx)
decimal.InvalidOperation: [<class 'decimal.InvalidOperation'>]

Seems like that dec.quantize call is the proximal cause of the error.

But why would the Decimal dtype be used in the test?

    @pytest.mark.parametrize("coerce", [True, False])
    def test_dataframe_schema_case(coerce):
        """Test a simple schema case."""
        schema = pa.DataFrameSchema(
            {
                "int_column": pa.Column(int, pa.Check.ge(0)),
                "float_column": pa.Column(float, pa.Check.le(0)),
                "str_column": pa.Column(str, pa.Check.isin(list("abcde"))),
            },
            coerce=coerce,
        )
        mdf = mpd.DataFrame(
            {
                "int_column": range(10),
                "float_column": [float(-x) for x in range(10)],
                "str_column": list("aabbcceedd"),
            }
        )
>       assert isinstance(schema.validate(mdf), mpd.DataFrame)

cosmicBboy · 2022-05-08T17:09:57Z

I was able to reproduce the error in CI with modin==0.14.1

cosmicBboy · 2022-05-08T17:25:48Z

looks like modin isn't happy when you do operations on empty series, case in point:

pandera/schemas.py:1988: in validate
    reshaped_failure_cases = reshape_failure_cases(failure_cases)
pandera/error_formatters.py:135: in reshape_failure_cases
    if ignore_na

it's failing in error_formatters.py:

    return (
        reshaped_failure_cases.dropna()
        if ignore_na
        else reshaped_failure_cases
    )

pandera/schemas.py

cosmicBboy · 2022-05-18T02:06:41Z

hey @jeffzi is this PR ready for review? also, if you rebase onto dev some of those test failures will go away

jeffzi · 2022-05-18T21:40:06Z

Sorry for not following up earlier. Merging dev fixed the fastapi failures.

The PR is ready for review if you don't mind the missing documentation. I will add it this week-end and we can finally wrap up the PR.

jeffzi · 2022-05-22T12:16:24Z

@cosmicBboy Added documentation and CI passes except for slight codecov decrease.

cosmicBboy · 2022-05-22T18:31:33Z

Thanks @jeffzi! Lemme give it a final look

cosmicBboy

looking good! just updated the docs a little and inline comment re: tests for ValueError execution paths

pandera/dtypes.py

cosmicBboy · 2022-05-26T01:23:37Z

Amazing @jeffzi ! 🚀

* add imports to fastapi docs * Add option to disallow duplicate column names (#758) * ENH: add duplicate detection to dataframeschema * ENH: propagate duplicate colnames check to schemamodel * Add getter setter property * make schemamodel actually work, update __str__ * fix __repr__ as well * fix incorrect default value * black formatting has changed * invert parameter naming convention * address other PR comments * fix doctests, comma in __str__ * maybe fix sphinx errors * fix ci and mypy tests * Update test_schemas.py * fix lint Co-authored-by: cosmicBboy <[email protected]> * Make SchemaModel use class name, define own config (#761) * Make SchemaModel use class name, define own config * fix * fix * fix * fix tests * fix lint and docs * add test Co-authored-by: cosmicBboy <[email protected]> * implement coercion-on-initialization for DataFrame[SchemaModel] types (#772) * implement coercion-on-initialization * pylint * Update tests/core/test_model.py Co-authored-by: Matt Richards <[email protected]> Co-authored-by: Matt Richards <[email protected]> * update conda install instructions (#776) * add documentation for pandas_engine.DateTime (#780) * add documentation for pandas_engine.DateTime * fix removed numpy_engine.Object doc * set default n_failure_cases to None (#784) * Update filtering columns for performance reasons. (#777) * Update filtering columns for performance reasons. * Update pandera/schemas.py * Update schemas.py * Update schemas.py * Bugfix in schemas.py Co-authored-by: Niels Bantilan <[email protected]> * implement pydantic model data type (#779) * make finding coerce failure cases faster (#792) * make finding coerce failure cases faster * fix tests * remove unneeded import * fix tests, coverage * update docs for 0.10.0 (#795) * add pyspark support, deprecate koalas (#793) * add support for pyspark.pandas, deprecate koalas * update docs * add type check in pandas generics * update docs * clean up ci * fix mypy, generics * fix generic hack * improve coverage * Add overloads to `schema.to_yaml` (#790) * Add overloads to `to_yaml` * Update schemas.py Co-authored-by: Niels Bantilan <[email protected]> * add support for logical data types * add initial support for decimal * fix dtype check * Feature: Add support for Generic to SchemaModel (#810) * Adapt SchemaModel so that it can inherit from typing.Generic * Extend SchemaModel to enable generic types in fields * fix linter Co-authored-by: Thomas Willems <[email protected]> Co-authored-by: cosmicBboy <[email protected]> * fix pandas_engine.DateTime.coerce_value not consistent with coerce (#827) * pyspark docs fixes * fix koalas link to pyspark * bump version 0.10.1 * fix pandas_engine.DateTime.coerce_value not consistent with coerce Co-authored-by: cosmicBboy <[email protected]> * Refactor logical type check method * add logical types tests * add back conftest * fix test_invalid_annotations * fix ray initialization in setup_modin_engine * fix logical type validation when output is an iterable * add Decimal data type to pandera.__init__ * remove DataType.is_logical * add logical types documentation * Update dtypes.rst * Update dtypes.rst * increase coverage * fix SchemaErrors.failure_cases with logical types * fix modin compatibility for logical type validation * fix prepare_series_check_output compatibility with pyspark * fix mypy error * Update dtypes.rst Co-authored-by: cosmicBboy <[email protected]> Co-authored-by: Matt Richards <[email protected]> Co-authored-by: Sean Mackesey <[email protected]> Co-authored-by: Ferdinand Hahmann <[email protected]> Co-authored-by: Robert Craigie <[email protected]> Co-authored-by: tfwillems <[email protected]> Co-authored-by: Thomas Willems <[email protected]>

cosmicBboy and others added 15 commits February 10, 2022 08:41

add imports to fastapi docs

fed2a47

Make SchemaModel use class name, define own config (unionai-oss#761)

e721dfa

* Make SchemaModel use class name, define own config * fix * fix * fix * fix tests * fix lint and docs * add test Co-authored-by: cosmicBboy <[email protected]>

implement coercion-on-initialization for DataFrame[SchemaModel] types (…

0f4b525

…unionai-oss#772) * implement coercion-on-initialization * pylint * Update tests/core/test_model.py Co-authored-by: Matt Richards <[email protected]> Co-authored-by: Matt Richards <[email protected]>

update conda install instructions (unionai-oss#776)

ebfecc1

add documentation for pandas_engine.DateTime (unionai-oss#780)

24fe511

* add documentation for pandas_engine.DateTime * fix removed numpy_engine.Object doc

set default n_failure_cases to None (unionai-oss#784)

7e7d19f

Update filtering columns for performance reasons. (unionai-oss#777)

18b2baf

* Update filtering columns for performance reasons. * Update pandera/schemas.py * Update schemas.py * Update schemas.py * Bugfix in schemas.py Co-authored-by: Niels Bantilan <[email protected]>

implement pydantic model data type (unionai-oss#779)

9a43c14

make finding coerce failure cases faster (unionai-oss#792)

47ccd9f

* make finding coerce failure cases faster * fix tests * remove unneeded import * fix tests, coverage

update docs for 0.10.0 (unionai-oss#795)

95e3204

add pyspark support, deprecate koalas (unionai-oss#793)

d1af8aa

* add support for pyspark.pandas, deprecate koalas * update docs * add type check in pandas generics * update docs * clean up ci * fix mypy, generics * fix generic hack * improve coverage

Add overloads to schema.to_yaml (unionai-oss#790)

7e1742f

* Add overloads to `to_yaml` * Update schemas.py Co-authored-by: Niels Bantilan <[email protected]>

add support for logical data types

39d90a5

add initial support for decimal

94e9689

fix dtype check

ca4505e

cosmicBboy mentioned this pull request Mar 29, 2022

Add special case to check values of str column #808

Open

3 tasks

cosmicBboy force-pushed the dev branch from 99d49a9 to 5913499 Compare April 1, 2022 11:18

tfwillems and others added 5 commits April 19, 2022 08:28

fix pandas_engine.DateTime.coerce_value not consistent with coerce (u…

880b05c

…nionai-oss#827) * pyspark docs fixes * fix koalas link to pyspark * bump version 0.10.1 * fix pandas_engine.DateTime.coerce_value not consistent with coerce Co-authored-by: cosmicBboy <[email protected]>

Merge branch 'dev' into feature/logical_type

679d0e9

Merge tag 'v0.11.0' into feature/logical_type

651692b

Refactor logical type check method

2388ef5

cosmicBboy reviewed May 8, 2022

View reviewed changes

pandera/schemas.py Outdated Show resolved Hide resolved

fix logical type validation when output is an iterable

95ee40d

Merge branch 'dev' into feature/logical_type

cf95072

jeffzi added 3 commits May 22, 2022 11:56

add Decimal data type to pandera.__init__

693aca1

remove DataType.is_logical

5f34049

add logical types documentation

237086c

cosmicBboy added 2 commits May 22, 2022 14:56

Update dtypes.rst

5ffab6b

Update dtypes.rst

a7e19a1

cosmicBboy reviewed May 22, 2022

View reviewed changes

pandera/dtypes.py Show resolved Hide resolved

jeffzi and others added 6 commits May 23, 2022 21:30

increase coverage

4d75371

fix SchemaErrors.failure_cases with logical types

836e98e

fix modin compatibility for logical type validation

cd0fc1f

fix prepare_series_check_output compatibility with pyspark

89bb802

fix mypy error

d2b2781

Update dtypes.rst

d73ed60

cosmicBboy merged commit 07d25a9 into unionai-oss:dev May 26, 2022

jeffzi mentioned this pull request Jul 5, 2022

date32 type not supported using infer_schema #881

Open

2 tasks

the-matt-morris mentioned this pull request Sep 11, 2022

Not require coerce == True for PydanticModels #942

Closed

3 tasks

cosmicBboy mentioned this pull request Nov 7, 2022

not require coerce==True when for PydandticModels #1011

Merged

Uh oh!

Support for logical data types #798

Support for logical data types #798

Uh oh!

Conversation

jeffzi commented Mar 24, 2022

Uh oh!

jeffzi commented Mar 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cosmicBboy commented Mar 26, 2022

Uh oh!

cosmicBboy commented Mar 29, 2022

Uh oh!

jeffzi commented Mar 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffzi commented Apr 7, 2022

Uh oh!

cosmicBboy commented Apr 7, 2022

Uh oh!

jeffzi commented Apr 7, 2022

Uh oh!

jeffzi commented May 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cosmicBboy commented May 8, 2022

Uh oh!

cosmicBboy commented May 8, 2022

Uh oh!

cosmicBboy commented May 8, 2022

Uh oh!

cosmicBboy commented May 8, 2022

Uh oh!

Uh oh!

cosmicBboy commented May 18, 2022

Uh oh!

jeffzi commented May 18, 2022

Uh oh!

jeffzi commented May 22, 2022

Uh oh!

cosmicBboy commented May 22, 2022

Uh oh!

cosmicBboy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cosmicBboy commented May 26, 2022

Uh oh!

Uh oh!

jeffzi commented Mar 24, 2022 •

edited

Loading

jeffzi commented Mar 31, 2022 •

edited

Loading

jeffzi commented May 8, 2022 •

edited

Loading