Skip to content

Feature: Add support for Generic to SchemaModel #810

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

tfwillems
Copy link
Contributor

Currently, SchemaModel is not compatible with typing.Generic. There are 2 straightforward applications in which I see Generic being useful:

  1. In cases where SchemaModel sub-class methods involving generic types:
    e.g.
import pandera as pa
from typing import TypeVar, Generic

T = TypeVar("T")

class Foo(pa.SchemaModel, Generic[T]):
    @classmethod
    def bar(cls) -> T:
        raise NotImplementedError

class Bar1(Foo[int]):
    @classmethod
    def bar(cls) -> int:
        return 1

class Bar2(Foo[str]):
    @classmethod
    def bar(cls) -> str:
        return "1"

Currently, this won't work as it results in the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/cluster/home/willems/.local/lib/python3.9/site-packages/pandera/model.py", line 169, in __init_subclass__
    cls.__config__, cls.__extras__ = cls._collect_config_and_extras()
  File "/cluster/home/willems/.local/lib/python3.9/site-packages/pandera/model.py", line 407, in _collect_config_and_extras
    options, extras = _extract_config_options_and_extras(root_model.Config)
AttributeError: type object 'Generic' has no attribute 'Config'
  1. A far more interesting application is when the types of fields are generic:
import pandera as pa
from typing import TypeVar, Generic
from pandera.errors import SchemaError

class GenericModel(pa.SchemaModel, Generic[T]):
    x: Series[int]
    y: Series[T]

class IntModel(GenericModel[int]):
    ...

class FloatModel(GenericModel[float]):
    ...

IntModel.validate(pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]}))
with pytest.raises(SchemaError):
    FloatModel.validate(pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]}))

with pytest.raises(SchemaError):
    IntModel.validate(pd.DataFrame({"x": [1, 2, 3], "y": [4.0, 5, 6]}))
FloatModel.validate(pd.DataFrame({"x": [1, 2, 3], "y": [4.0, 5, 6]}))

This application is also not currently supported.

This PR makes a few minor modifications to SchemaModel to enable both applications of Generic. I've also added a set of unit tests that suggest this is working as intended. The code required to enable Generic types in fields was heavily inspired by what's present in pydantic

I wasn't sure whether it'd be ideal for SchemaModel to natively support generic arguments, or whether it'd be better to create a sub-class like GenericSchemaModel that enables this functionality. Ultimately, I opted for the former, as the changes seemed fairly minimal.

I tried to follow the best practice dev docs, but ran into a few issues:

  1. The FastAPI unit tests seem to fail
  2. The black formatting run by nox seemed to abort in an exception

Would love to get your thoughts on the utility of this and the prototype I've provided

@codecov
Copy link

codecov bot commented Mar 30, 2022

Codecov Report

Merging #810 (45e3b6f) into dev (9a463e1) will decrease coverage by 0.05%.
The diff coverage is 90.90%.

@@            Coverage Diff             @@
##              dev     #810      +/-   ##
==========================================
- Coverage   97.34%   97.28%   -0.06%     
==========================================
  Files          43       43              
  Lines        3990     4020      +30     
==========================================
+ Hits         3884     3911      +27     
- Misses        106      109       +3     
Impacted Files Coverage Δ
pandera/model.py 94.55% <90.90%> (-0.52%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9a463e1...45e3b6f. Read the comment docs.

@cosmicBboy
Copy link
Collaborator

Thanks @tfwillems, this is awesome!

I'll review it over the next few days, but I'm on-board with the use case and solution overall. @jeffzi let me know if you have any thoughts here.

The FastAPI unit tests seem to fail

CI seems to be happy, can you do a pip freeze on your environment? I can try to reproduce

The black formatting run by nox seemed to abort in an exception

In CI it seems to be isort that's the problem. You're seeing something different in your pre-commit runs?

@tfwillems
Copy link
Contributor Author

The FastAPI tests may be failing b/c I'm running this on a server that often restricts reads/writes to other URLs so I'm not sure if that is driving the test failures for the 3 FastAPI tests.

When running pre-commit on the original dev branch of pandera, I obtained the following stacktrace:

black....................................................................Failed
- hook id: black
- exit code: 1

Traceback (most recent call last):
  File "/cluster/home/willems/.cache/pre-commit/repob39yzst3/py_env-python3.9/bin/black", line 8, in <module>
    sys.exit(patched_main())
  File "/cluster/home/willems/.cache/pre-commit/repob39yzst3/py_env-python3.9/lib/python3.9/site-packages/black/__init__.py", line 1423, in patched_main
    patch_click()
  File "/cluster/home/willems/.cache/pre-commit/repob39yzst3/py_env-python3.9/lib/python3.9/site-packages/black/__init__.py", line 1409, in patch_click
    from click import _unicodefun
ImportError: cannot import name '_unicodefun' from 'click' (/cluster/home/willems/.cache/pre-commit/repob39yzst3/py_env-python3.9/lib/python3.9/site-packages/click/__init__.py)

Here's the result of pip freeze:

alabaster==0.7.12
anyio==3.5.0
argcomplete==1.12.3
asgiref==3.5.0
astroid==2.9.3
asv==0.5.1
async-timeout==4.0.2
attrs==21.4.0
Babel==2.9.1
beautifulsoup4==4.10.0
black==22.3.0
bleach==4.1.0
certifi==2021.10.8
cffi==1.15.0
cfgv==3.3.1
chardet==4.0.0
charset-normalizer==2.0.12
click==8.0.4
click-plugins==1.1.1
cligj==0.7.2
cloudpickle==2.0.0
codecov==2.1.12
colorama==0.4.4
colorlog==6.6.0
commonmark==0.9.1
coverage==6.3.2
cryptography==36.0.2
dask==2022.3.0
decorator==5.1.1
Deprecated==1.2.13
distlib==0.3.4
distributed==2022.3.0
docutils==0.17.1
execnet==1.9.0
fastapi==0.75.0
filelock==3.6.0
Fiona==1.8.21
frictionless==4.28.1
fsspec==2022.2.0
furo==2021.10.9
geopandas==0.10.2
grpcio==1.44.0
h11==0.13.0
HeapDict==1.0.1
hypothesis==6.40.0
identify==2.4.12
idna==3.3
imagesize==1.3.0
importlib-metadata==4.11.3
iniconfig==1.1.1
isodate==0.6.1
isort==5.10.1
jeepney==0.7.1
Jinja2==3.1.1
jsonschema==4.4.0
keyring==23.5.0
lazy-object-proxy==1.7.1
locket==0.2.1
marko==1.2.0
MarkupSafe==2.1.1
mccabe==0.6.1
modin==0.14.0
msgpack==1.0.3
munch==2.5.0
mypy==0.921
mypy-extensions==0.4.3
nodeenv==1.6.0
nox==2022.1.7
numpy==1.22.3
packaging==21.3
pandas==1.4.1
pandas-stubs==1.2.0.54
-e git+https://github.com/tfwillems/pandera.git@cd1861ca4b12ab83cc18db7266f3cccf5f20dbc9#egg=pandera
partd==1.2.0
pathspec==0.9.0
petl==1.7.8
pkginfo==1.8.2
platformdirs==2.5.1
pluggy==1.0.0
pre-commit==2.17.0
protobuf==3.19.4
psutil==5.9.0
py==1.11.0
py4j==0.10.9.3
pyarrow==7.0.0
pycparser==2.21
pydantic==1.9.0
Pygments==2.11.2
pylint==2.12.2
pyparsing==3.0.7
pyproj==3.3.0
pyrsistent==0.18.1
pyspark==3.2.1
pytest==7.1.1
pytest-asyncio==0.18.3
pytest-cov==3.0.0
pytest-forked==1.4.0
pytest-xdist==2.5.0
python-dateutil==2.8.2
python-multipart==0.0.5
python-slugify==6.1.1
pytz==2022.1
PyYAML==6.0
ray==1.7.0
readme-renderer==34.0
recommonmark==0.7.1
redis==4.2.0
requests==2.27.1
requests-toolbelt==0.9.1
rfc3986==2.0.0
scipy==1.8.0
SecretStorage==3.3.1
Shapely==1.8.1.post1
shellingham==1.4.0
simpleeval==0.9.12
six==1.16.0
sniffio==1.2.0
snowballstemmer==2.2.0
sortedcontainers==2.4.0
soupsieve==2.3.1
Sphinx==4.5.0
sphinx-autodoc-typehints==1.14.1
sphinx-copybutton==0.5.0
sphinx-panels==0.6.0
sphinxcontrib-applehelp==1.0.2
sphinxcontrib-devhelp==1.0.2
sphinxcontrib-htmlhelp==2.0.0
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.3
sphinxcontrib-serializinghtml==1.1.5
starlette==0.17.1
stringcase==1.2.0
tblib==1.7.0
text-unidecode==1.3
toml==0.10.2
tomli==2.0.1
toolz==0.11.2
tornado==6.1
tqdm==4.63.1
twine==3.8.0
typer==0.4.0
types-click==7.1.8
types-pkg-resources==0.1.3
types-PyYAML==6.0.5
types-requests==2.27.15
types-urllib3==1.26.11
typing-extensions==4.1.1
typing-inspect==0.7.1
urllib3==1.26.9
uvicorn==0.17.6
validators==0.18.2
virtualenv==20.14.0
webencodings==0.5.1
wrapt==1.13.3
xdoctest==1.0.0
zict==2.1.0
zipp==3.7.0

@jeffzi
Copy link
Collaborator

jeffzi commented Apr 1, 2022

Thanks for your contribution @tfwillems. I agree it could open the doors to new applications, especially to create reusable intermediate models.

@cosmicBboy cosmicBboy force-pushed the feature/schema-model-generic-support branch from cd1861c to 7a36331 Compare April 19, 2022 02:04
@cosmicBboy
Copy link
Collaborator

Hi @tfwillems thanks for the contribution!

It looks like a few areas in the new code aren't covered by tests, see here.

Merging this PR to dev now, but if you're down for it please add unit tests for those 3 uncovered cases!

@cosmicBboy cosmicBboy merged commit b438e15 into unionai-oss:dev Apr 19, 2022
cosmicBboy added a commit that referenced this pull request Apr 29, 2022
* Adapt SchemaModel so that it can inherit from typing.Generic

* Extend SchemaModel to enable generic types in fields

* fix linter

Co-authored-by: Thomas Willems <[email protected]>
Co-authored-by: cosmicBboy <[email protected]>
cosmicBboy added a commit that referenced this pull request May 26, 2022
* add imports to fastapi docs

* Add option to disallow duplicate column names (#758)

* ENH: add duplicate detection to dataframeschema

* ENH: propagate duplicate colnames check to schemamodel

* Add getter setter property

* make schemamodel actually work, update __str__

* fix __repr__ as well

* fix incorrect default value

* black formatting has changed

* invert parameter naming convention

* address other PR comments

* fix doctests, comma in __str__

* maybe fix sphinx errors

* fix ci and mypy tests

* Update test_schemas.py

* fix lint

Co-authored-by: cosmicBboy <[email protected]>

* Make SchemaModel use class name, define own config (#761)

* Make SchemaModel use class name, define own config

* fix

* fix

* fix

* fix tests

* fix lint and docs

* add test

Co-authored-by: cosmicBboy <[email protected]>

* implement coercion-on-initialization for DataFrame[SchemaModel] types (#772)

* implement coercion-on-initialization

* pylint

* Update tests/core/test_model.py

Co-authored-by: Matt Richards <[email protected]>

Co-authored-by: Matt Richards <[email protected]>

* update conda install instructions (#776)

* add documentation for pandas_engine.DateTime (#780)

* add documentation for pandas_engine.DateTime

* fix removed numpy_engine.Object doc

* set default n_failure_cases to None (#784)

* Update filtering columns for performance reasons. (#777)

* Update filtering columns for performance reasons.

* Update pandera/schemas.py

* Update schemas.py

* Update schemas.py

* Bugfix in schemas.py

Co-authored-by: Niels Bantilan <[email protected]>

* implement pydantic model data type (#779)

* make finding coerce failure cases faster (#792)

* make finding coerce failure cases faster

* fix tests

* remove unneeded import

* fix tests, coverage

* update docs for 0.10.0 (#795)

* add pyspark support, deprecate koalas (#793)

* add support for pyspark.pandas, deprecate koalas

* update docs

* add type check in pandas generics

* update docs

* clean up ci

* fix mypy, generics

* fix generic hack

* improve coverage

* Add overloads to `schema.to_yaml` (#790)

* Add overloads to `to_yaml`

* Update schemas.py

Co-authored-by: Niels Bantilan <[email protected]>

* add support for logical data types

* add initial support for decimal

* fix dtype check

* Feature: Add support for Generic to SchemaModel (#810)

* Adapt SchemaModel so that it can inherit from typing.Generic

* Extend SchemaModel to enable generic types in fields

* fix linter

Co-authored-by: Thomas Willems <[email protected]>
Co-authored-by: cosmicBboy <[email protected]>

* fix pandas_engine.DateTime.coerce_value not consistent with coerce (#827)

* pyspark docs fixes

* fix koalas link to pyspark

* bump version 0.10.1

* fix pandas_engine.DateTime.coerce_value not consistent with coerce

Co-authored-by: cosmicBboy <[email protected]>

* Refactor logical type check method

* add logical types tests

* add back conftest

* fix test_invalid_annotations

* fix ray initialization in setup_modin_engine

* fix logical type validation when output is an iterable

* add Decimal data type to pandera.__init__

* remove DataType.is_logical

* add logical types documentation

* Update dtypes.rst

* Update dtypes.rst

* increase coverage

* fix SchemaErrors.failure_cases with logical types

* fix modin compatibility for logical type validation

* fix prepare_series_check_output compatibility with pyspark

* fix mypy error

* Update dtypes.rst

Co-authored-by: cosmicBboy <[email protected]>
Co-authored-by: Matt Richards <[email protected]>
Co-authored-by: Sean Mackesey <[email protected]>
Co-authored-by: Ferdinand Hahmann <[email protected]>
Co-authored-by: Robert Craigie <[email protected]>
Co-authored-by: tfwillems <[email protected]>
Co-authored-by: Thomas Willems <[email protected]>
cosmicBboy added a commit that referenced this pull request Aug 10, 2022
* add imports to fastapi docs

* Add option to disallow duplicate column names (#758)

* ENH: add duplicate detection to dataframeschema

* ENH: propagate duplicate colnames check to schemamodel

* Add getter setter property

* make schemamodel actually work, update __str__

* fix __repr__ as well

* fix incorrect default value

* black formatting has changed

* invert parameter naming convention

* address other PR comments

* fix doctests, comma in __str__

* maybe fix sphinx errors

* fix ci and mypy tests

* Update test_schemas.py

* fix lint

Co-authored-by: cosmicBboy <[email protected]>

* Make SchemaModel use class name, define own config (#761)

* Make SchemaModel use class name, define own config

* fix

* fix

* fix

* fix tests

* fix lint and docs

* add test

Co-authored-by: cosmicBboy <[email protected]>

* implement coercion-on-initialization for DataFrame[SchemaModel] types (#772)

* implement coercion-on-initialization

* pylint

* Update tests/core/test_model.py

Co-authored-by: Matt Richards <[email protected]>

Co-authored-by: Matt Richards <[email protected]>

* update conda install instructions (#776)

* add documentation for pandas_engine.DateTime (#780)

* add documentation for pandas_engine.DateTime

* fix removed numpy_engine.Object doc

* set default n_failure_cases to None (#784)

* Update filtering columns for performance reasons. (#777)

* Update filtering columns for performance reasons.

* Update pandera/schemas.py

* Update schemas.py

* Update schemas.py

* Bugfix in schemas.py

Co-authored-by: Niels Bantilan <[email protected]>

* implement pydantic model data type (#779)

* make finding coerce failure cases faster (#792)

* make finding coerce failure cases faster

* fix tests

* remove unneeded import

* fix tests, coverage

* update docs for 0.10.0 (#795)

* add pyspark support, deprecate koalas (#793)

* add support for pyspark.pandas, deprecate koalas

* update docs

* add type check in pandas generics

* update docs

* clean up ci

* fix mypy, generics

* fix generic hack

* improve coverage

* Add overloads to `schema.to_yaml` (#790)

* Add overloads to `to_yaml`

* Update schemas.py

Co-authored-by: Niels Bantilan <[email protected]>

* add support for logical data types

* add initial support for decimal

* fix dtype check

* Feature: Add support for Generic to SchemaModel (#810)

* Adapt SchemaModel so that it can inherit from typing.Generic

* Extend SchemaModel to enable generic types in fields

* fix linter

Co-authored-by: Thomas Willems <[email protected]>
Co-authored-by: cosmicBboy <[email protected]>

* fix pandas_engine.DateTime.coerce_value not consistent with coerce (#827)

* pyspark docs fixes

* fix koalas link to pyspark

* bump version 0.10.1

* fix pandas_engine.DateTime.coerce_value not consistent with coerce

Co-authored-by: cosmicBboy <[email protected]>

* Refactor logical type check method

* add logical types tests

* add back conftest

* fix test_invalid_annotations

* fix ray initialization in setup_modin_engine

* fix logical type validation when output is an iterable

* add Decimal data type to pandera.__init__

* remove DataType.is_logical

* add logical types documentation

* Update dtypes.rst

* Update dtypes.rst

* increase coverage

* fix SchemaErrors.failure_cases with logical types

* fix modin compatibility for logical type validation

* fix prepare_series_check_output compatibility with pyspark

* fix mypy error

* Update dtypes.rst

Co-authored-by: cosmicBboy <[email protected]>
Co-authored-by: Matt Richards <[email protected]>
Co-authored-by: Sean Mackesey <[email protected]>
Co-authored-by: Ferdinand Hahmann <[email protected]>
Co-authored-by: Robert Craigie <[email protected]>
Co-authored-by: tfwillems <[email protected]>
Co-authored-by: Thomas Willems <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants