Skip to content

Support for logical data types #798

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 39 commits into from
May 26, 2022
Merged

Conversation

jeffzi
Copy link
Collaborator

@jeffzi jeffzi commented Mar 24, 2022

Fixes #788

So far Pandera only support physical data types, i.e. have a 1-1 relationship with pandas/numpy dtypes. Logical data types represent the abstracted understanding of that data.

The use-cases are:

  1. Logical types that consist of a pandas dtype + a check on values: IP, URLs, paths. These can currently be validated with a Check but coercion is not possible.

  2. Dtypes unofficially supported by pandas: date, decimal, etc..Another example is the new PydanticModel introduced in implement pydantic model data type #779

This PR is a proof-of-concept. I've I added an attribute DataType.is_logical. When True, we expect DataType.check to return a mask of valid data, similar to the output of a check function. This is necessary to report failure cases, which was impossible in my initial proposal #788.

@cosmicBboy I tested this approach with the Decimal data type. I'd like to have your opinion before cleaning up the code and adding robust tests. I played with returning a Check or CheckResult class but did not find it very elegant.

cosmicBboy and others added 15 commits February 10, 2022 08:41
* ENH: add duplicate detection to dataframeschema

* ENH: propagate duplicate colnames check to schemamodel

* Add getter setter property

* make schemamodel actually work, update __str__

* fix __repr__ as well

* fix incorrect default value

* black formatting has changed

* invert parameter naming convention

* address other PR comments

* fix doctests, comma in __str__

* maybe fix sphinx errors

* fix ci and mypy tests

* Update test_schemas.py

* fix lint

Co-authored-by: cosmicBboy <[email protected]>
* Make SchemaModel use class name, define own config

* fix

* fix

* fix

* fix tests

* fix lint and docs

* add test

Co-authored-by: cosmicBboy <[email protected]>
…unionai-oss#772)

* implement coercion-on-initialization

* pylint

* Update tests/core/test_model.py

Co-authored-by: Matt Richards <[email protected]>

Co-authored-by: Matt Richards <[email protected]>
* add documentation for pandas_engine.DateTime

* fix removed numpy_engine.Object doc
* Update filtering columns for performance reasons.

* Update pandera/schemas.py

* Update schemas.py

* Update schemas.py

* Bugfix in schemas.py

Co-authored-by: Niels Bantilan <[email protected]>
* make finding coerce failure cases faster

* fix tests

* remove unneeded import

* fix tests, coverage
* add support for pyspark.pandas, deprecate koalas

* update docs

* add type check in pandas generics

* update docs

* clean up ci

* fix mypy, generics

* fix generic hack

* improve coverage
* Add overloads to `to_yaml`

* Update schemas.py

Co-authored-by: Niels Bantilan <[email protected]>
@jeffzi
Copy link
Collaborator Author

jeffzi commented Mar 24, 2022

Here is a copy/pastable example to play with:

import pandas as pd
import pandera as pa
from decimal import Decimal
from pandera.engines import pandas_engine


data = pd.DataFrame({"col": [Decimal("999.99") for _ in range(3)] + ["foobar"]})

schema = pa.DataFrameSchema({"col": pa.Column(pandas_engine.Decimal(5, 2))})
schema.validate(data)

@cosmicBboy
Copy link
Collaborator

This looks good overall @jeffzi !

We might consider a DataType subclass for LogicalDataType... I think it might provide a cleaner interface for the check method. Also, perhaps two separate methods for clarity might make sense: check_dtype for the current check interface, and check_value for checking an actual value.

# dtypes module

class DataType(ABC): ...

    def check_dtype(self, pandera_dtype): ...  # basically the same as current check method

class LogicalDataType(DataType): ...

    # logical data types defines an additional method
    def check_value(self, data_container): ...

@cosmicBboy
Copy link
Collaborator

#807 also seems like a good use case for this... there seems to be a regression in the str type-checking behavior.

Before, str was special-cased since it used the numpy object datatype to represent strings... so pandera would actually check the values of the object array to make sure all the values were actually strings and not other types.

@jeffzi
Copy link
Collaborator Author

jeffzi commented Mar 31, 2022

We might consider a DataType subclass for LogicalDataType.

I wanted to avoid a test isinstance(self.dtype, LogicalDataType), which is less pythonic imho... yeah I know 😓. Another issue is that check_dtype is a prerequisite for check_value (e.g. Decimal requires object dtype, IP requires str) but you'd still need to call check_value as well to have complete validation. I would prefer DataFrameSchema to not know about the subtleties of data types.

Another option is to go back to my initial idea def check(self, pandera_dtype: DataType, data_container: Optional[Any]=None) -> Union[bool, Sequence[bool]). DataFrameSchema can give the data container every time and produce an appropriate SchemaError depending on the results (wrapping failure cases and appropriate error message).

Some points to note:

  1. This design does not make the data_container mandatory for logical types even if we know it is necessary. pandera_dtype can also be inferred from data_container (currently done prior to calling check).We could change the signature to check(self, data_container). It would require to refactor internal pandera tests. It's a breaking change to the public API but should only affect a minority of users. It would still require a deprecation warning one version ahead just in case.

  2. It also does not force check to return a sequence of booleans for logical types. That said, it makes sense to allow to return a scalar bool if a data type is costly to validate and/or the volume of data is high, and the user does not care about failure cases.

@jeffzi
Copy link
Collaborator Author

jeffzi commented Apr 7, 2022

@cosmicBboy Friendly ping :)

@cosmicBboy
Copy link
Collaborator

I would prefer DataFrameSchema to not know about the subtleties of data types.

Agreed!

Another option is to go back to my initial idea def check(self, pandera_dtype: DataType, data_container: Optional[Any]=None) -> Union[bool, Sequence[bool])

Let's go for this option for now. check(self, data_container) is cleaner in principle, but I'd prioritize no breaking changes over clean-ness, and I think it's fairly intuitive to have an optional arg for the data_container in the case that the physical datatype isn't enough to verify the logical datatype.

When we learn more and have a better sense of how physical/logical dtypes work we can think about a cleaner (but breaking-change) interface... it's funny to think this project is getting mature enough where 1.0.0 is something to think about in the medium term 😅.

@jeffzi
Copy link
Collaborator Author

jeffzi commented Apr 7, 2022

Thanks for your input. I agree to not break things now. Probably a decision to make before pandera hits 1.0 😎

tfwillems and others added 5 commits April 19, 2022 08:28
* Adapt SchemaModel so that it can inherit from typing.Generic

* Extend SchemaModel to enable generic types in fields

* fix linter

Co-authored-by: Thomas Willems <[email protected]>
Co-authored-by: cosmicBboy <[email protected]>
…nionai-oss#827)

* pyspark docs fixes

* fix koalas link to pyspark

* bump version 0.10.1

* fix pandas_engine.DateTime.coerce_value not consistent with coerce

Co-authored-by: cosmicBboy <[email protected]>
@jeffzi
Copy link
Collaborator Author

jeffzi commented May 8, 2022

@cosmicBboy I'm having trouble with the modin-ray tests.

They pass on my local machine but fail for python < 3.10 in the CI. I did move the setup_modin_engine fixture to tests/modin/conftest.py in order to share it with the logical dtype tests. I don't see how this would cause the ci to fail. I'm probably missing something since I'm not very familiar with modin and ray.

Could you have a look please?

Other than that, I did manage to make Decimal work with pyspark because pyarrow complains about Decimal in a series typed with object. As far as I can tell, we need a pandas udf. I'd rather wait for the refactor of pandera's handling of data container libraries before going down that road.

Let me know if the structure of testing works for you and I'll add placeholders for logical types tests in tests/pyspark. In the future, I think a similar structure could work to test regular data types too.

@cosmicBboy
Copy link
Collaborator

Thanks @jeffzi, lemme take a look

@cosmicBboy
Copy link
Collaborator

just looking at the error messages:

  File "/usr/share/miniconda/envs/pandera-dev/lib/python3.9/site-packages/pandas/core/apply.py", line 851, in apply_standard
    results, res_index = self.apply_series_generator()
  File "/usr/share/miniconda/envs/pandera-dev/lib/python3.9/site-packages/pandas/core/apply.py", line 867, in apply_series_generator
    results[i] = self.f(v)
  File "/usr/share/miniconda/envs/pandera-dev/lib/python3.9/site-packages/pandas/core/frame.py", line 8922, in infer
    return lib.map_infer(x.astype(object)._values, func, ignore_na=ignore_na)
  File "pandas/_libs/lib.pyx", line 2870, in pandas._libs.lib.map_infer
  File "/usr/share/miniconda/envs/pandera-dev/lib/python3.9/site-packages/modin/pandas/series.py", line 1248, in <lambda>
    lambda s: arg(s)
  File "/home/runner/work/pandera/pandera/pandera/engines/pandas_engine.py", line 503, in coerce_value
    return dec.quantize(self._exp, context=self._ctx)
decimal.InvalidOperation: [<class 'decimal.InvalidOperation'>]

Seems like that dec.quantize call is the proximal cause of the error.

But why would the Decimal dtype be used in the test?

    @pytest.mark.parametrize("coerce", [True, False])
    def test_dataframe_schema_case(coerce):
        """Test a simple schema case."""
        schema = pa.DataFrameSchema(
            {
                "int_column": pa.Column(int, pa.Check.ge(0)),
                "float_column": pa.Column(float, pa.Check.le(0)),
                "str_column": pa.Column(str, pa.Check.isin(list("abcde"))),
            },
            coerce=coerce,
        )
        mdf = mpd.DataFrame(
            {
                "int_column": range(10),
                "float_column": [float(-x) for x in range(10)],
                "str_column": list("aabbcceedd"),
            }
        )
>       assert isinstance(schema.validate(mdf), mpd.DataFrame)

@cosmicBboy
Copy link
Collaborator

I was able to reproduce the error in CI with modin==0.14.1

@cosmicBboy
Copy link
Collaborator

looks like modin isn't happy when you do operations on empty series, case in point:

pandera/schemas.py:1988: in validate
    reshaped_failure_cases = reshape_failure_cases(failure_cases)
pandera/error_formatters.py:135: in reshape_failure_cases
    if ignore_na

it's failing in error_formatters.py:

    return (
        reshaped_failure_cases.dropna()
        if ignore_na
        else reshaped_failure_cases
    )

@cosmicBboy
Copy link
Collaborator

hey @jeffzi is this PR ready for review? also, if you rebase onto dev some of those test failures will go away

@jeffzi
Copy link
Collaborator Author

jeffzi commented May 18, 2022

Sorry for not following up earlier. Merging dev fixed the fastapi failures.

The PR is ready for review if you don't mind the missing documentation. I will add it this week-end and we can finally wrap up the PR.

@jeffzi
Copy link
Collaborator Author

jeffzi commented May 22, 2022

@cosmicBboy Added documentation and CI passes except for slight codecov decrease.

@cosmicBboy
Copy link
Collaborator

Thanks @jeffzi! Lemme give it a final look

Copy link
Collaborator

@cosmicBboy cosmicBboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good! just updated the docs a little and inline comment re: tests for ValueError execution paths

@cosmicBboy
Copy link
Collaborator

Amazing @jeffzi ! 🚀

@cosmicBboy cosmicBboy merged commit 07d25a9 into unionai-oss:dev May 26, 2022
cosmicBboy added a commit that referenced this pull request Aug 10, 2022
* add imports to fastapi docs

* Add option to disallow duplicate column names (#758)

* ENH: add duplicate detection to dataframeschema

* ENH: propagate duplicate colnames check to schemamodel

* Add getter setter property

* make schemamodel actually work, update __str__

* fix __repr__ as well

* fix incorrect default value

* black formatting has changed

* invert parameter naming convention

* address other PR comments

* fix doctests, comma in __str__

* maybe fix sphinx errors

* fix ci and mypy tests

* Update test_schemas.py

* fix lint

Co-authored-by: cosmicBboy <[email protected]>

* Make SchemaModel use class name, define own config (#761)

* Make SchemaModel use class name, define own config

* fix

* fix

* fix

* fix tests

* fix lint and docs

* add test

Co-authored-by: cosmicBboy <[email protected]>

* implement coercion-on-initialization for DataFrame[SchemaModel] types (#772)

* implement coercion-on-initialization

* pylint

* Update tests/core/test_model.py

Co-authored-by: Matt Richards <[email protected]>

Co-authored-by: Matt Richards <[email protected]>

* update conda install instructions (#776)

* add documentation for pandas_engine.DateTime (#780)

* add documentation for pandas_engine.DateTime

* fix removed numpy_engine.Object doc

* set default n_failure_cases to None (#784)

* Update filtering columns for performance reasons. (#777)

* Update filtering columns for performance reasons.

* Update pandera/schemas.py

* Update schemas.py

* Update schemas.py

* Bugfix in schemas.py

Co-authored-by: Niels Bantilan <[email protected]>

* implement pydantic model data type (#779)

* make finding coerce failure cases faster (#792)

* make finding coerce failure cases faster

* fix tests

* remove unneeded import

* fix tests, coverage

* update docs for 0.10.0 (#795)

* add pyspark support, deprecate koalas (#793)

* add support for pyspark.pandas, deprecate koalas

* update docs

* add type check in pandas generics

* update docs

* clean up ci

* fix mypy, generics

* fix generic hack

* improve coverage

* Add overloads to `schema.to_yaml` (#790)

* Add overloads to `to_yaml`

* Update schemas.py

Co-authored-by: Niels Bantilan <[email protected]>

* add support for logical data types

* add initial support for decimal

* fix dtype check

* Feature: Add support for Generic to SchemaModel (#810)

* Adapt SchemaModel so that it can inherit from typing.Generic

* Extend SchemaModel to enable generic types in fields

* fix linter

Co-authored-by: Thomas Willems <[email protected]>
Co-authored-by: cosmicBboy <[email protected]>

* fix pandas_engine.DateTime.coerce_value not consistent with coerce (#827)

* pyspark docs fixes

* fix koalas link to pyspark

* bump version 0.10.1

* fix pandas_engine.DateTime.coerce_value not consistent with coerce

Co-authored-by: cosmicBboy <[email protected]>

* Refactor logical type check method

* add logical types tests

* add back conftest

* fix test_invalid_annotations

* fix ray initialization in setup_modin_engine

* fix logical type validation when output is an iterable

* add Decimal data type to pandera.__init__

* remove DataType.is_logical

* add logical types documentation

* Update dtypes.rst

* Update dtypes.rst

* increase coverage

* fix SchemaErrors.failure_cases with logical types

* fix modin compatibility for logical type validation

* fix prepare_series_check_output compatibility with pyspark

* fix mypy error

* Update dtypes.rst

Co-authored-by: cosmicBboy <[email protected]>
Co-authored-by: Matt Richards <[email protected]>
Co-authored-by: Sean Mackesey <[email protected]>
Co-authored-by: Ferdinand Hahmann <[email protected]>
Co-authored-by: Robert Craigie <[email protected]>
Co-authored-by: tfwillems <[email protected]>
Co-authored-by: Thomas Willems <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for logical data types
7 participants