-
-
Notifications
You must be signed in to change notification settings - Fork 344
How to perform data checking without errors ? #249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
HI @UGuntupalli thanks for your interest in this project!
Currently, One thing you may find useful is the try:
schema(df)
except pa.errors.SchemaError as exc:
exc.data # contains the invalid data
exc.failure_cases # contains a dataframe of failure cases You can even do this with lazy validation to obtain the errors from all the I'm curious what your use case is for obtaining the boolean result of data validation, is it for visualization/further analysis purposes? I'm currently considering adding an interface for obtaining boolean/boolean vector validation results directly instead of catching errors, and it would be helpful to understand your use case as I'm fleshing that idea out.
Not yet, but you're the first person to bring this up and I think it's a good idea :) (feel free to open up a feature request)
import numpy as np
is_monotonically_increasing = pa.Check(lambda s: np.diff(s) > 0) # monotonically increasing (strict)
is_monotonically_increasing = pa.Check(lambda s: np.diff(s) >= 0) # monotonically increasing (non-strict)
is_monotonically_increasing = pa.Check(lambda s: np.diff(s) < 0) # monotonically decreasing (strict)
is_monotonically_increasing = pa.Check(lambda s: np.diff(s) <= 0) # monotonically decreasing (non-strict)
What does stuck/arbitrary jumps in the data mean? (small example data would help a lot). If one can express it in terms of a function that takes a Edit: just read your descriptions of this in pyjanitor-devs/pyjanitor#703 and seems like the |
@cosmicBboy , |
Thanks for the description! that makes sense. Based on your descriptions, would it be accurate to call your use case "filtering, masking, and imputation"?. Currently, I actually think there may be a nice way that |
The # check_result is a namedtuple containing various metadata,
# including the boolean Series/DataFrame, where False
# indicates elements that did not pass the check
check_result = pa.Check.is_monotonic(increasing=True, strict=True)(df.index)
# do something with the boolean Series
check_result.output.mean()
check_result.output.sum()
~check_result.output.sum() I've been thinking about this for some time, but I think a more functional interface for from pandera import checks
bool_series = checks.is_monotonic(df["time_column"], increasing=True, strict=True) One thing that I think would be cool, but would be a fairly significant undertaking, would be to have a |
@cosmicBboy, |
(https://stackoverflow.com/questions/63023783/are-there-any-python-packages-for-time-series-cleaning)
Question about pandera
Thank you for the work in developing and maintaining the package. I was doing some research about data cleaning/checking. I found the
pyjanitor
package very helpful. As I started to discuss an issue with the owner of the package, he suggested checkingpandera
(pyjanitor-devs/pyjanitor#703) .I think
pandera
has a great structure and I would love to use it. Could you kindly help answer some simple questions for me ? The code sample pasted below is from the package docs. My questions are:The text was updated successfully, but these errors were encountered: