Skip to content

feature/koalas-beta #651

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Oct 15, 2021
Merged

feature/koalas-beta #651

merged 9 commits into from
Oct 15, 2021

Conversation

cosmicBboy
Copy link
Collaborator

@cosmicBboy cosmicBboy commented Oct 13, 2021

Addresses one part of #601

This PR introduces support for koalas object validation (DataFrame, Series, and Index), so pandera can be used like so:

Install

pip install pandera[koalas]

Then validate away!

schema = pa.DataFrameSchema(
    {
        "int_column": pa.Column(int, pa.Check.ge(0)),
        "float_column": pa.Column(float, pa.Check.le(0)),
        "str_column": pa.Column(str, pa.Check.isin(list("abcde"))),
    },
    coerce=True,
)
kdf = ks.DataFrame(
    {
        "int_column": range(10),
        "float_column": [float(-x) for x in range(10)],
        "str_column": list("aabbcceedd"),
    }
)

Notes:

  • check_utils module adds a bunch of methods for checking the type of a object-to-validate. This should probably be moved somewhere else.
  • littered about in the core schema validation logic are checks for koalas objects, e.g. here and here. Will clean up this tech debt as part of Abstract out validation logic to support non-pandas dataframes, e.g. spark, dask, etc #381
  • there are a few places that use a footgun, i.e. enable koalas features that negatively impacts performance, e.g. here which enables the koalas compute.ops_on_diff_frames config, which allows for computations across dataframes (which can involve expensive join operations). Clearing this tech debt would require someone more familiar with the koalas project.
  • This conditional checks for presence of a pandera accessor class. Need to find a way around this for modin, which doesn't currently support the accessor extension utility. Koalas does, though, so will need to implement that extension in a future PR.

This feature is in beta, so many bugs are expected.

cosmicBboy and others added 6 commits September 22, 2021 08:41
* Strategies should not rely on pandas dtype aliases (#620)

* add test for strategy with pandas.DatetimeTZDtype using a datetime.tzinfo

* avoid coercing with string alias in strategies

* support timedelta in data synthesis strats (#621)

* fix multiindex error reporting (#622)

* Pin pylint (#629)

* bump pre-commit pylint version

* pin pylint

* remove setuptools pins

* setup.py setuptools

* add back setuptools dep

* update ci build

* update build

* update nox build

* update nox build

* exclude np.float128 type registration in MacM1 (#624)

* exclude np.float128 type registration in MacM1

* replace windows/mac m1 checks with float128 check

* fix numpy_pandas_coercible bug dealing with single element (#626)

* fix numpy_pandas_coercible bug dealing with single element

* add test

* remove empty case

* update pylint (#630)

* unpin pylint, remove setuptools constraint

* bump cache

* install simpleeval in noxfile

* re-pin pylint

* fix lint

* nox uses setuptools < 58.0.0

Co-authored-by: Jean-Francois Zinque <[email protected]>
* add test for all pandas-compatible numpy dtypes

* add support for np.bytes_

* add support for rare object aliases

* add support for platform-specific numpy dtypes
* bugfix: support nullable empty strategies

fix #634

* update black, mypy

* hypothesis health check

* fix
fixes #640. This PR improves the performance of schema strategies that
involve nullable fields. Instead of a 10x performance hit it's a
2x performance hit for specifying a nullable column.
* reuse coerce logic in engines.utils

* add test_coerce_error

* rename coerce to try_coerce and _coerce to coerce
@cosmicBboy cosmicBboy changed the base branch from master to dev October 13, 2021 05:12
@codecov
Copy link

codecov bot commented Oct 13, 2021

Codecov Report

Merging #651 (9ed4496) into dev (c786f67) will increase coverage by 0.08%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #651      +/-   ##
==========================================
+ Coverage   98.85%   98.94%   +0.08%     
==========================================
  Files          30       31       +1     
  Lines        3398     3497      +99     
==========================================
+ Hits         3359     3460     +101     
+ Misses         39       37       -2     
Impacted Files Coverage Δ
pandera/model.py 100.00% <ø> (ø)
pandera/__init__.py 100.00% <100.00%> (ø)
pandera/check_utils.py 100.00% <100.00%> (ø)
pandera/checks.py 98.50% <100.00%> (ø)
pandera/engines/pandas_engine.py 99.32% <100.00%> (+0.02%) ⬆️
pandera/engines/utils.py 100.00% <100.00%> (ø)
pandera/error_formatters.py 92.00% <100.00%> (-3.46%) ⬇️
pandera/errors.py 100.00% <100.00%> (ø)
pandera/external_config.py 100.00% <100.00%> (ø)
pandera/schema_components.py 99.54% <100.00%> (+0.02%) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c786f67...9ed4496. Read the comment docs.

* improve lazy validation performance for nullable cases

fixes #652

This PR fixes an issue where setting `lazy=True` with a schema
where `nullable=False` and there are lot of null values causes
severe performance issues in the ~500,000 row dataframe case.

The fix is to drop duplicates when aggregating failure cases
and removing unnecessary data processing of lazily collected
failure cases.

* reintroduce sorting/dropping of duplicates
add tests for koalas

fix type issues with koalas patch to pd.Series, DataFrame

add datatype koalas tests

finish writing initial test suite for koalas

fix regressions

configure koalas

fix regressions

update pylint dep

update deps

update black

fix lint

use context manager for koalas ops_on_diff_frames

updates

update pre-commit mypy

typing ignore

fix docs

install hypothesis for koalas ci

don't cover modin import check

better handling of timestamp

fix koalas

wip

wip

wip

coverage

hypothesis health check
@cosmicBboy cosmicBboy changed the base branch from dev to release/0.8.0 October 15, 2021 12:50
@cosmicBboy cosmicBboy merged this pull request into release/0.8.0 Oct 15, 2021
@cosmicBboy cosmicBboy deleted the feature/koalas-beta branch October 15, 2021 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant