implement pydantic model data type #779

cosmicBboy · 2022-03-03T14:41:26Z

fixes #764

codecov · 2022-03-05T05:18:31Z

Codecov Report

Merging #779 (37acb98) into dev (ebfecc1) will decrease coverage by 0.07%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              dev     #779      +/-   ##
==========================================
- Coverage   97.71%   97.63%   -0.08%     
==========================================
  Files          45       44       -1     
  Lines        4026     4059      +33     
==========================================
+ Hits         3934     3963      +29     
- Misses         92       96       +4

Impacted Files	Coverage Δ
pandera/checks.py	`98.51% <ø> (ø)`
pandera/error_formatters.py	`92.59% <ø> (-0.27%)`	⬇️
pandera/model.py	`95.07% <ø> (ø)`
pandera/engines/pandas_engine.py	`97.45% <100.00%> (+0.20%)`	⬆️
pandera/schemas.py	`99.24% <100.00%> (+<0.01%)`	⬆️
pandera/typing/config.py	`100.00% <100.00%> (ø)`
pandera/check_utils.py	`90.00% <0.00%> (-6.67%)`	⬇️
pandera/constants.py

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ebfecc1...37acb98. Read the comment docs.

jeffzi

Really cool use of DataType !

I have 2 remarks:

If coerce is set to False, validation passes since the schema think there are no columns to validate. Pandera should warn if the global dtype is a PydanticModel and coerce is False.
PydanticModel(Record).check(df) always return False. It's not a big deal since most users won't call it. We can override check and raise a NotImplementedError for now.

To expand on 2., one problem with the current DataType.check() method is that it validates the data type without access to the data itself. Therefore you cannot re-use the coerce logic to validate with check. I've ran into this problem when implementing decimal and date dtypes, which is why I've been holding back my PR. Those dtypes can be coerced by not validated without looking at the data. I will open a separate issue because it deserves its own discussion.

Minimal example:

import pandas as pd
from pydantic import BaseModel
import pandera as pa
from pandera.engines.pandas_engine import PydanticModel


class Record(BaseModel):
    """Pydantic record model."""

    age: int


class PydanticSchema(pa.SchemaModel):
    """Pandera schema using the pydantic model."""

    class Config:
        """Config with dataframe-level data type."""

        dtype = PydanticModel(Record)
        coerce = False # <------ 

df = pd.DataFrame({"age": [21, "foo"]})
PydanticSchema.validate(df) # passes
#>    age
0   21
1  foo

PydanticModel(Record).check(df) # unsupported, expects a data type
#> False

.github/workflows/ci-tests.yml

pandera/engines/pandas_engine.py

cosmicBboy · 2022-03-09T02:49:34Z

@jeffzi I ended up adding back the mypy nox session back, will tackle the pre-commit mypy issue in #786

also, please ✅ this PR if all looks good

jeffzi · 2022-03-09T23:02:14Z

Did you see my 2 remarks above?

If coerce is set to False, validation passes since the schema think there are no columns to validate. Pandera should warn if the global dtype is a PydanticModel and coerce is False.

^ I think this one is important to fix before merging.

cosmicBboy · 2022-03-10T14:37:52Z

Ah, right, those slipped my mind. I think for (1) I'm gonna play around with adding a auto_coerce: bool class attribute to DataType, by default this will be false be it'll be True for PydanticModel. Warnings are helpful but I think it's better UX to make it easy to do the right thing, in this case, during the coerce step the schema will coerce the dtype if auto_coerce is True.

pandera/engines/pandas_engine.py

cosmicBboy · 2022-03-18T02:21:00Z

@jeffzi I decided to special-case the PydanticModel dtype in the DataFrameSchema constructor, raising a SchemaInitError if coerce=False and dtype is a PydanticModel... in this case I thought it best to have pandera barf in this case instead of a warning.

* add imports to fastapi docs * Add option to disallow duplicate column names (#758) * ENH: add duplicate detection to dataframeschema * ENH: propagate duplicate colnames check to schemamodel * Add getter setter property * make schemamodel actually work, update __str__ * fix __repr__ as well * fix incorrect default value * black formatting has changed * invert parameter naming convention * address other PR comments * fix doctests, comma in __str__ * maybe fix sphinx errors * fix ci and mypy tests * Update test_schemas.py * fix lint Co-authored-by: cosmicBboy <[email protected]> * Make SchemaModel use class name, define own config (#761) * Make SchemaModel use class name, define own config * fix * fix * fix * fix tests * fix lint and docs * add test Co-authored-by: cosmicBboy <[email protected]> * implement coercion-on-initialization for DataFrame[SchemaModel] types (#772) * implement coercion-on-initialization * pylint * Update tests/core/test_model.py Co-authored-by: Matt Richards <[email protected]> Co-authored-by: Matt Richards <[email protected]> * update conda install instructions (#776) * add documentation for pandas_engine.DateTime (#780) * add documentation for pandas_engine.DateTime * fix removed numpy_engine.Object doc * set default n_failure_cases to None (#784) * Update filtering columns for performance reasons. (#777) * Update filtering columns for performance reasons. * Update pandera/schemas.py * Update schemas.py * Update schemas.py * Bugfix in schemas.py Co-authored-by: Niels Bantilan <[email protected]> * implement pydantic model data type (#779) * make finding coerce failure cases faster (#792) * make finding coerce failure cases faster * fix tests * remove unneeded import * fix tests, coverage * update docs for 0.10.0 (#795) * add pyspark support, deprecate koalas (#793) * add support for pyspark.pandas, deprecate koalas * update docs * add type check in pandas generics * update docs * clean up ci * fix mypy, generics * fix generic hack * improve coverage * Add overloads to `schema.to_yaml` (#790) * Add overloads to `to_yaml` * Update schemas.py Co-authored-by: Niels Bantilan <[email protected]> * add support for logical data types * add initial support for decimal * fix dtype check * Feature: Add support for Generic to SchemaModel (#810) * Adapt SchemaModel so that it can inherit from typing.Generic * Extend SchemaModel to enable generic types in fields * fix linter Co-authored-by: Thomas Willems <[email protected]> Co-authored-by: cosmicBboy <[email protected]> * fix pandas_engine.DateTime.coerce_value not consistent with coerce (#827) * pyspark docs fixes * fix koalas link to pyspark * bump version 0.10.1 * fix pandas_engine.DateTime.coerce_value not consistent with coerce Co-authored-by: cosmicBboy <[email protected]> * Refactor logical type check method * add logical types tests * add back conftest * fix test_invalid_annotations * fix ray initialization in setup_modin_engine * fix logical type validation when output is an iterable * add Decimal data type to pandera.__init__ * remove DataType.is_logical * add logical types documentation * Update dtypes.rst * Update dtypes.rst * increase coverage * fix SchemaErrors.failure_cases with logical types * fix modin compatibility for logical type validation * fix prepare_series_check_output compatibility with pyspark * fix mypy error * Update dtypes.rst Co-authored-by: cosmicBboy <[email protected]> Co-authored-by: Matt Richards <[email protected]> Co-authored-by: Sean Mackesey <[email protected]> Co-authored-by: Ferdinand Hahmann <[email protected]> Co-authored-by: Robert Craigie <[email protected]> Co-authored-by: tfwillems <[email protected]> Co-authored-by: Thomas Willems <[email protected]>

implement pydantic model data type

06c1e37

cosmicBboy mentioned this pull request Mar 3, 2022

Dataframe schema from Pydantic record model #764

Closed

cosmicBboy requested a review from jeffzi March 3, 2022 14:44

cosmicBboy added 5 commits March 3, 2022 09:46

fix lint

940da66

fixes

9790c97

update pydantic type

0f3b7a0

update linting ci

ba25769

ignore pydantic model in strategies

7592dcd

jeffzi requested changes Mar 5, 2022

View reviewed changes

.github/workflows/ci-tests.yml Outdated Show resolved Hide resolved

pandera/engines/pandas_engine.py Outdated Show resolved Hide resolved

add back mypy nox, use np.nan

c0d783b

cosmicBboy requested a review from jeffzi March 6, 2022 03:50

jeffzi mentioned this pull request Mar 9, 2022

Add support for logical data types #788

Open

jeffzi reviewed Mar 13, 2022

View reviewed changes

pandera/engines/pandas_engine.py Show resolved Hide resolved

fix PydanticModel dtype

ab125db

jeffzi approved these changes Mar 18, 2022

View reviewed changes

cosmicBboy added 2 commits March 18, 2022 09:06

fix types

3771c06

fix type

37acb98

cosmicBboy merged commit 9a43c14 into dev Mar 19, 2022

cosmicBboy deleted the pydantic-dtype branch March 19, 2022 17:55

cosmicBboy mentioned this pull request Mar 20, 2022

Benchmark: PydanticModel validation vs equivalent Pandera schema #794

Open

jeffzi mentioned this pull request Mar 24, 2022

Support for logical data types #798

Merged

cosmicBboy added a commit that referenced this pull request Apr 1, 2022

implement pydantic model data type (#779)

3ea4f2c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

implement pydantic model data type #779

implement pydantic model data type #779

Uh oh!

cosmicBboy commented Mar 3, 2022

Uh oh!

codecov bot commented Mar 5, 2022 •

edited

Loading

Uh oh!

jeffzi left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

cosmicBboy commented Mar 9, 2022 •

edited

Loading

Uh oh!

jeffzi commented Mar 9, 2022

Uh oh!

cosmicBboy commented Mar 10, 2022

Uh oh!

Uh oh!

cosmicBboy commented Mar 18, 2022

Uh oh!

Uh oh!

Uh oh!

implement pydantic model data type #779

implement pydantic model data type #779

Uh oh!

Conversation

cosmicBboy commented Mar 3, 2022

Uh oh!

codecov bot commented Mar 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jeffzi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cosmicBboy commented Mar 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffzi commented Mar 9, 2022

Uh oh!

cosmicBboy commented Mar 10, 2022

Uh oh!

Uh oh!

cosmicBboy commented Mar 18, 2022

Uh oh!

Uh oh!

codecov bot commented Mar 5, 2022 •

edited

Loading

jeffzi left a comment •

edited

Loading

cosmicBboy commented Mar 9, 2022 •

edited

Loading