Skip to content

How to perform data checking without errors ? #249

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
UGuntupalli opened this issue Jul 22, 2020 · 5 comments
Closed

How to perform data checking without errors ? #249

UGuntupalli opened this issue Jul 22, 2020 · 5 comments
Labels
question Further information is requested

Comments

@UGuntupalli
Copy link

Question about pandera

Thank you for the work in developing and maintaining the package. I was doing some research about data cleaning/checking. I found the pyjanitor package very helpful. As I started to discuss an issue with the owner of the package, he suggested checking pandera (pyjanitor-devs/pyjanitor#703) .

I think pandera has a great structure and I would love to use it. Could you kindly help answer some simple questions for me ? The code sample pasted below is from the package docs. My questions are:

  1. Is there a way to generate a boolean instead of an error when the logical condition being validated for fails ? I see that there is a way to operate with Nulls in the data, I am thinking along the same lines. For e.g. If I wanted to validate whether the values in column 1 <= 9, currently an error is generated. Can I instead get a column of Booleans ?
  2. Do you have any specific functionality that can target time_series_data. For e.g.: checking for duplicate, missing timestamps in the index or if the time stamps are monotonic or not etc ?
  3. Do you have any functionality that would enable flagging stuck data or arbitrary jumps in the data ?
# Your code here, if applicable
import pandas as pd
import pandera as pa

# data to validate
df = pd.DataFrame({
    "column1": [1, 4, 0, 10, 9],
    "column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
    "column3": ["value_1", "value_2", "value_3", "value_2", "value_1"],
})

# define schema
schema = pa.DataFrameSchema({
    "column1": pa.Column(pa.Int, checks=pa.Check.less_than_or_equal_to(10)),
    "column2": pa.Column(pa.Float, checks=pa.Check.less_than(-1.2)),
    "column3": pa.Column(pa.String, checks=[
        pa.Check.str_startswith("value_"),
        # define custom checks as functions that take a series as input and
        # outputs a boolean or boolean Series
        pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
    ]),
})

validated_df = schema.validate(df)
print(validated_df)
@UGuntupalli UGuntupalli added the question Further information is requested label Jul 22, 2020
@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Jul 23, 2020

HI @UGuntupalli thanks for your interest in this project!

Is there a way to generate a boolean instead of an error when the logical condition being validated for fails ?

Currently, pandera only reports on failure cases, i.e. where a pandera.Check returns a False scalar or pd.Series vector.

One thing you may find useful is the try...except pattern to catch the error and access the failure cases:

try:
    schema(df)
except pa.errors.SchemaError as exc:
    exc.data  # contains the invalid data
    exc.failure_cases  # contains a dataframe of failure cases

You can even do this with lazy validation to obtain the errors from all the Checks at once via lazy validation.

I'm curious what your use case is for obtaining the boolean result of data validation, is it for visualization/further analysis purposes? I'm currently considering adding an interface for obtaining boolean/boolean vector validation results directly instead of catching errors, and it would be helpful to understand your use case as I'm fleshing that idea out.

Do you have any specific functionality that can target time_series_data. For e.g.: checking for duplicate, missing timestamps in the index or if the time stamps are monotonic or not etc ?

Not yet, but you're the first person to bring this up and I think it's a good idea :) (feel free to open up a feature request)

  • In terms of checking for duplicates, the Column(..., allow_duplicates=False) and Index(...., allow_duplicates=False) should have you covered
  • For missing timestamps in the index, you can use Index(..., nullable=False) (I think it's False by default already)
  • For monotonicity, the Check interface is quite flexible and you can express these checks fairly easily, I think:
import numpy as np

is_monotonically_increasing = pa.Check(lambda s: np.diff(s) > 0)  # monotonically increasing (strict)
is_monotonically_increasing = pa.Check(lambda s: np.diff(s) >= 0)  # monotonically increasing (non-strict)
is_monotonically_increasing = pa.Check(lambda s: np.diff(s) < 0)  # monotonically decreasing (strict)
is_monotonically_increasing = pa.Check(lambda s: np.diff(s) <= 0)  # monotonically decreasing (non-strict)

Do you have any functionality that would enable flagging stuck data or arbitrary jumps in the data

What does stuck/arbitrary jumps in the data mean? (small example data would help a lot). If one can express it in terms of a function that takes a pd.Series or pd.DataFrame as an input and outputs a boolean scalar, Series, or DataFrame, pandera.Check can support it :). Once we have good definitions going, it's just a matter of adding these in to the official built-in Check interface.

Edit: just read your descriptions of this in pyjanitor-devs/pyjanitor#703 and seems like the stuck and jump should be fairly straightforward to implement

@UGuntupalli
Copy link
Author

@cosmicBboy ,
The primary use-case for my need is to use the filters for visualization. These customizable automated filters would enable the easy visualization of obvious issues in the time-series data, thus helping manual/human intervention on issues which are subtle or not so obvious. The request behind returning a boolean, the same size of the input in most cases is to enable easy visualization as well, but also to allow for custom treating based on the filters, rather than directly modify the raw data. For e.g. there are use-cases where in certain data sets, I would prefer to just filter out data and replace them with np.NaN, but there are other cases, where I would try to fill in the data with interpolation based on existing data. Let me know if you want to show me the ropes and I am glad to contribute to the package and help advance this functionality

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Jul 24, 2020

Thanks for the description! that makes sense.

Based on your descriptions, would it be accurate to call your use case "filtering, masking, and imputation"?.

Currently, pandera is well-designed to handle data validation. The only parsing it does is via the coerce keyword argument, which simply coerces the datatype of each column. One thing I'd want to understand better is the workflow that you have for filtering the data, do you think you can come up with a minimally reproducible code example of how your workflow goes if you were to use pandas only? I think that would shed light onto whether pandera can be extended in functionality.

I actually think there may be a nice way that pandera and pyjanitor can play well together to fulfill your use case with the current pandera interface. Since this is a big departure from the current scope of the package, I'd like to make sure the abstractions are appropriate to do this without any major additions in functionality, and if not, we'd then have to consider what an extended API for this would look like.

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Jul 24, 2020

The pandera.Check interface only needs a slight modification to return the a boolean scalar, Series, DataFrame that results from a particular check, which I think would fulfill the visualization use case that you have. For example, something like:

# check_result is a namedtuple containing various metadata,
# including the boolean Series/DataFrame, where False
# indicates elements that did not pass the check
check_result = pa.Check.is_monotonic(increasing=True, strict=True)(df.index)

# do something with the boolean Series
check_result.output.mean()
check_result.output.sum()
~check_result.output.sum()

I've been thinking about this for some time, but I think a more functional interface for checks would better-support your use case of getting boolean outputs that the user is responsible for using to visualize/filter/impute the data.

from pandera import checks

bool_series = checks.is_monotonic(df["time_column"], increasing=True, strict=True)

One thing that I think would be cool, but would be a fairly significant undertaking, would be to have a pandera.Parser class that modifies the data as a function of the check_result.output boolean data. This sort of bleeds a little bit into data cleaning, but I'm wondering if a clear and consistent API would fit in here, would be curious if you have any thoughts.

@UGuntupalli
Copy link
Author

UGuntupalli commented Jul 25, 2020

@cosmicBboy,
Please refer to pyjanitor-devs/pyjanitor#703 for a proposed workflow. I am thinking it is better to hash it out with both - you and @ericmjl in this issue and then get going on the work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants