Skip to content

implement pydantic model data type #779

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Mar 19, 2022
Merged

implement pydantic model data type #779

merged 10 commits into from
Mar 19, 2022

Conversation

cosmicBboy
Copy link
Collaborator

fixes #764

@codecov
Copy link

codecov bot commented Mar 5, 2022

Codecov Report

Merging #779 (37acb98) into dev (ebfecc1) will decrease coverage by 0.07%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              dev     #779      +/-   ##
==========================================
- Coverage   97.71%   97.63%   -0.08%     
==========================================
  Files          45       44       -1     
  Lines        4026     4059      +33     
==========================================
+ Hits         3934     3963      +29     
- Misses         92       96       +4     
Impacted Files Coverage Δ
pandera/checks.py 98.51% <ø> (ø)
pandera/error_formatters.py 92.59% <ø> (-0.27%) ⬇️
pandera/model.py 95.07% <ø> (ø)
pandera/engines/pandas_engine.py 97.45% <100.00%> (+0.20%) ⬆️
pandera/schemas.py 99.24% <100.00%> (+<0.01%) ⬆️
pandera/typing/config.py 100.00% <100.00%> (ø)
pandera/check_utils.py 90.00% <0.00%> (-6.67%) ⬇️
pandera/constants.py

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ebfecc1...37acb98. Read the comment docs.

Copy link
Collaborator

@jeffzi jeffzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool use of DataType !

I have 2 remarks:

  1. If coerce is set to False, validation passes since the schema think there are no columns to validate. Pandera should warn if the global dtype is a PydanticModel and coerce is False.

  2. PydanticModel(Record).check(df) always return False. It's not a big deal since most users won't call it. We can override check and raise a NotImplementedError for now.

To expand on 2., one problem with the current DataType.check() method is that it validates the data type without access to the data itself. Therefore you cannot re-use the coerce logic to validate with check. I've ran into this problem when implementing decimal and date dtypes, which is why I've been holding back my PR. Those dtypes can be coerced by not validated without looking at the data. I will open a separate issue because it deserves its own discussion.

Minimal example:

import pandas as pd
from pydantic import BaseModel
import pandera as pa
from pandera.engines.pandas_engine import PydanticModel


class Record(BaseModel):
    """Pydantic record model."""

    age: int


class PydanticSchema(pa.SchemaModel):
    """Pandera schema using the pydantic model."""

    class Config:
        """Config with dataframe-level data type."""

        dtype = PydanticModel(Record)
        coerce = False # <------ 

df = pd.DataFrame({"age": [21, "foo"]})
PydanticSchema.validate(df) # passes
#>    age
0   21
1  foo

PydanticModel(Record).check(df) # unsupported, expects a data type
#> False

@cosmicBboy cosmicBboy requested a review from jeffzi March 6, 2022 03:50
@cosmicBboy
Copy link
Collaborator Author

cosmicBboy commented Mar 9, 2022

@jeffzi I ended up adding back the mypy nox session back, will tackle the pre-commit mypy issue in #786

also, please ✅ this PR if all looks good

@jeffzi
Copy link
Collaborator

jeffzi commented Mar 9, 2022

Did you see my 2 remarks above?

If coerce is set to False, validation passes since the schema think there are no columns to validate. Pandera should warn if the global dtype is a PydanticModel and coerce is False.

^ I think this one is important to fix before merging.

@cosmicBboy
Copy link
Collaborator Author

Ah, right, those slipped my mind. I think for (1) I'm gonna play around with adding a auto_coerce: bool class attribute to DataType, by default this will be false be it'll be True for PydanticModel. Warnings are helpful but I think it's better UX to make it easy to do the right thing, in this case, during the coerce step the schema will coerce the dtype if auto_coerce is True.

@cosmicBboy
Copy link
Collaborator Author

@jeffzi I decided to special-case the PydanticModel dtype in the DataFrameSchema constructor, raising a SchemaInitError if coerce=False and dtype is a PydanticModel... in this case I thought it best to have pandera barf in this case instead of a warning.

@cosmicBboy cosmicBboy merged commit 9a43c14 into dev Mar 19, 2022
@cosmicBboy cosmicBboy deleted the pydantic-dtype branch March 19, 2022 17:55
cosmicBboy added a commit that referenced this pull request Apr 1, 2022
cosmicBboy added a commit that referenced this pull request May 26, 2022
* add imports to fastapi docs

* Add option to disallow duplicate column names (#758)

* ENH: add duplicate detection to dataframeschema

* ENH: propagate duplicate colnames check to schemamodel

* Add getter setter property

* make schemamodel actually work, update __str__

* fix __repr__ as well

* fix incorrect default value

* black formatting has changed

* invert parameter naming convention

* address other PR comments

* fix doctests, comma in __str__

* maybe fix sphinx errors

* fix ci and mypy tests

* Update test_schemas.py

* fix lint

Co-authored-by: cosmicBboy <[email protected]>

* Make SchemaModel use class name, define own config (#761)

* Make SchemaModel use class name, define own config

* fix

* fix

* fix

* fix tests

* fix lint and docs

* add test

Co-authored-by: cosmicBboy <[email protected]>

* implement coercion-on-initialization for DataFrame[SchemaModel] types (#772)

* implement coercion-on-initialization

* pylint

* Update tests/core/test_model.py

Co-authored-by: Matt Richards <[email protected]>

Co-authored-by: Matt Richards <[email protected]>

* update conda install instructions (#776)

* add documentation for pandas_engine.DateTime (#780)

* add documentation for pandas_engine.DateTime

* fix removed numpy_engine.Object doc

* set default n_failure_cases to None (#784)

* Update filtering columns for performance reasons. (#777)

* Update filtering columns for performance reasons.

* Update pandera/schemas.py

* Update schemas.py

* Update schemas.py

* Bugfix in schemas.py

Co-authored-by: Niels Bantilan <[email protected]>

* implement pydantic model data type (#779)

* make finding coerce failure cases faster (#792)

* make finding coerce failure cases faster

* fix tests

* remove unneeded import

* fix tests, coverage

* update docs for 0.10.0 (#795)

* add pyspark support, deprecate koalas (#793)

* add support for pyspark.pandas, deprecate koalas

* update docs

* add type check in pandas generics

* update docs

* clean up ci

* fix mypy, generics

* fix generic hack

* improve coverage

* Add overloads to `schema.to_yaml` (#790)

* Add overloads to `to_yaml`

* Update schemas.py

Co-authored-by: Niels Bantilan <[email protected]>

* add support for logical data types

* add initial support for decimal

* fix dtype check

* Feature: Add support for Generic to SchemaModel (#810)

* Adapt SchemaModel so that it can inherit from typing.Generic

* Extend SchemaModel to enable generic types in fields

* fix linter

Co-authored-by: Thomas Willems <[email protected]>
Co-authored-by: cosmicBboy <[email protected]>

* fix pandas_engine.DateTime.coerce_value not consistent with coerce (#827)

* pyspark docs fixes

* fix koalas link to pyspark

* bump version 0.10.1

* fix pandas_engine.DateTime.coerce_value not consistent with coerce

Co-authored-by: cosmicBboy <[email protected]>

* Refactor logical type check method

* add logical types tests

* add back conftest

* fix test_invalid_annotations

* fix ray initialization in setup_modin_engine

* fix logical type validation when output is an iterable

* add Decimal data type to pandera.__init__

* remove DataType.is_logical

* add logical types documentation

* Update dtypes.rst

* Update dtypes.rst

* increase coverage

* fix SchemaErrors.failure_cases with logical types

* fix modin compatibility for logical type validation

* fix prepare_series_check_output compatibility with pyspark

* fix mypy error

* Update dtypes.rst

Co-authored-by: cosmicBboy <[email protected]>
Co-authored-by: Matt Richards <[email protected]>
Co-authored-by: Sean Mackesey <[email protected]>
Co-authored-by: Ferdinand Hahmann <[email protected]>
Co-authored-by: Robert Craigie <[email protected]>
Co-authored-by: tfwillems <[email protected]>
Co-authored-by: Thomas Willems <[email protected]>
cosmicBboy added a commit that referenced this pull request Aug 10, 2022
* add imports to fastapi docs

* Add option to disallow duplicate column names (#758)

* ENH: add duplicate detection to dataframeschema

* ENH: propagate duplicate colnames check to schemamodel

* Add getter setter property

* make schemamodel actually work, update __str__

* fix __repr__ as well

* fix incorrect default value

* black formatting has changed

* invert parameter naming convention

* address other PR comments

* fix doctests, comma in __str__

* maybe fix sphinx errors

* fix ci and mypy tests

* Update test_schemas.py

* fix lint

Co-authored-by: cosmicBboy <[email protected]>

* Make SchemaModel use class name, define own config (#761)

* Make SchemaModel use class name, define own config

* fix

* fix

* fix

* fix tests

* fix lint and docs

* add test

Co-authored-by: cosmicBboy <[email protected]>

* implement coercion-on-initialization for DataFrame[SchemaModel] types (#772)

* implement coercion-on-initialization

* pylint

* Update tests/core/test_model.py

Co-authored-by: Matt Richards <[email protected]>

Co-authored-by: Matt Richards <[email protected]>

* update conda install instructions (#776)

* add documentation for pandas_engine.DateTime (#780)

* add documentation for pandas_engine.DateTime

* fix removed numpy_engine.Object doc

* set default n_failure_cases to None (#784)

* Update filtering columns for performance reasons. (#777)

* Update filtering columns for performance reasons.

* Update pandera/schemas.py

* Update schemas.py

* Update schemas.py

* Bugfix in schemas.py

Co-authored-by: Niels Bantilan <[email protected]>

* implement pydantic model data type (#779)

* make finding coerce failure cases faster (#792)

* make finding coerce failure cases faster

* fix tests

* remove unneeded import

* fix tests, coverage

* update docs for 0.10.0 (#795)

* add pyspark support, deprecate koalas (#793)

* add support for pyspark.pandas, deprecate koalas

* update docs

* add type check in pandas generics

* update docs

* clean up ci

* fix mypy, generics

* fix generic hack

* improve coverage

* Add overloads to `schema.to_yaml` (#790)

* Add overloads to `to_yaml`

* Update schemas.py

Co-authored-by: Niels Bantilan <[email protected]>

* add support for logical data types

* add initial support for decimal

* fix dtype check

* Feature: Add support for Generic to SchemaModel (#810)

* Adapt SchemaModel so that it can inherit from typing.Generic

* Extend SchemaModel to enable generic types in fields

* fix linter

Co-authored-by: Thomas Willems <[email protected]>
Co-authored-by: cosmicBboy <[email protected]>

* fix pandas_engine.DateTime.coerce_value not consistent with coerce (#827)

* pyspark docs fixes

* fix koalas link to pyspark

* bump version 0.10.1

* fix pandas_engine.DateTime.coerce_value not consistent with coerce

Co-authored-by: cosmicBboy <[email protected]>

* Refactor logical type check method

* add logical types tests

* add back conftest

* fix test_invalid_annotations

* fix ray initialization in setup_modin_engine

* fix logical type validation when output is an iterable

* add Decimal data type to pandera.__init__

* remove DataType.is_logical

* add logical types documentation

* Update dtypes.rst

* Update dtypes.rst

* increase coverage

* fix SchemaErrors.failure_cases with logical types

* fix modin compatibility for logical type validation

* fix prepare_series_check_output compatibility with pyspark

* fix mypy error

* Update dtypes.rst

Co-authored-by: cosmicBboy <[email protected]>
Co-authored-by: Matt Richards <[email protected]>
Co-authored-by: Sean Mackesey <[email protected]>
Co-authored-by: Ferdinand Hahmann <[email protected]>
Co-authored-by: Robert Craigie <[email protected]>
Co-authored-by: tfwillems <[email protected]>
Co-authored-by: Thomas Willems <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants