-
Notifications
You must be signed in to change notification settings - Fork 173
Add ability to generate boolean flags about data quality #703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @UGuntupalli! Thanks for chiming in. I'm super glad you like the package 😄. For data checks, I only recently found the package In "checking" data, we are asserting that properties of the data are correct. I think this is what you're trying to accomplish with the code above. Thinking in deeper theoretical/probabilistic terms, I think there are parallels to evaluating the likelihood of data against an assumed probability distribution model. If the data fall outside of the support of the assumed probability model, we error out. In "cleaning" data, we are not evaluating data against an assumed model of how the data ought to be, but rather changing the distribution and/or indexing of the data. In the R world, the "mutate" term is used as the overarching notion; in pyjanitor, we've sort of adopted the Pythonic philosophy of being explicit and provide a library of functions for that. Early on, I was tempted to add in data checks into pyjanitor, but once I saw pandera, gave it a test-drive, and thought about the distinction for a few more days, I can see why I would instead defer to putting the checks into pandera. That said, this doesn't preclude some good ideas you've raised becoming a I have a few function ideas inspired by what you wrote. Perhaps you can critique them, and if you're up for contributing, I'd be happy to help onboard you through the development process to get them in. (It's designed to be simultaneously beginner-friendly to get started, but also beginner-educational, in that we do adhere to "good software development practices" that most beginners won't have. If you're seasoned with code, you'll be right at home!) And if you have others for the cleaning part of data analysis, please feel free to suggest! Idea 1:
|
@ericmjl, import pandas as pd
import numpy as np
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.random.randint(0,100,size=(len(date_rng)))
print(df) The above code would generate something like the dataframe shown below. Now, before the user uses the function
date data
0 2018-01-01 00:00:00 67
1 2018-01-01 01:00:00 11
2 2018-01-01 02:00:00 75
3 2018-01-01 03:00:00 78
4 2018-01-01 04:00:00 35
.. ... ...
164 2018-01-07 20:00:00 19
165 2018-01-07 21:00:00 45
166 2018-01-07 22:00:00 14
167 2018-01-07 23:00:00 67
168 2018-01-08 00:00:00 96 Based on your explanation, I would think leaning towards (1) or (2) would make sense, but I am curious to hear what your thoughts are. As for Lastly, as for contributing to the code base, while I can't make a firm time commitment, if you can show me the ropes, I am happy to start slow with one function at a time. But, curious to hear your thoughts on the above. Cheers |
@ericmjl : import pandas as pd
from random import random
# Build a random data set
ts_index = pd.date_range('1/1/2000', periods=1000, freq='T')
v1 = [random() for i in range(1000)]
v2 = [random() for i in range(1000)]
v3 = [random() for i in range(1000)]
ts_df = pd.DataFrame({'v1':v1,'v2':v2,'v3':v3},index=ts_index)
# Test for timestamps
def test_for_monotonicity(input_data, column_name=None, direction='increasing'):
"""
Tests whether the pd.DataFrame is monotonically increasing or decreasing depending on the direction for
which the test needs to be performed.
:param pd.DataFrame input_data: dataframe to be tested
:param str column_name: needs to be specified if and only if the date time is not in index. defaults to None
:param str direction: specifies the direction in which monotonicity is being tested for. defaults to 'increasing'
:return: single boolean flag indicating whether the test has pssed or not
:rtype: bool
"""
# Test if the input is a data frame
assert isinstance(input_data, pd.DataFrame), "input_data must be a pandas dataframe"
# Test whether the column is pd.Timestamp
if column_name:
assert isinstance(input_data.index, pd.Timestamp), "column_name should be pandas timestamp"
input_data.index.name = "idx"
input_data.reset_index(inplace=True)
else:
assert isinstance(input_data[[column_name]], pd.Timestamp), "column_name should be pandas timestamp"
input_data.rename(columns={column_name: "idx"}, inplace=True)
# Test for monotonicity
if direction == 'increasing':
return input_data["idx"].is_monotonic()
elif direction == 'decreasing':
return input_data["idx"].is_monotonic_decreasing()
else:
raise ValueError("\n Argument direction accepts only increasing or decreasing ")
def sort_monotonically(input_data, column_name=None, direction='increasing'):
"""
Sorts the input dataframe monotonically based on the desired direction. It assumes that the dataframe has a
pd.TimeIndex as its index
:param pd.DataFrame input_data: dataframe to be tested
:param str column_name: needs to be specified if and only if the date time is not in index. defaults to None
:param str direction: specifies the direction in which monotonicity is being tested for. defaults to 'increasing'
:return: dataframe with its index sorted
:rtype: pd.DataFrame
"""
test_monotonicity = test_for_monotonicity(
input_data=input_data,
column_name=column_name,
direction=direction
)
# Standardize the data frame
if column_name:
input_data.reset_index(column_name, inplace=True)
if test_monotonicity:
return input_data
else:
if direction == "increasing":
input_data.sort_index(inplace=True)
else:
input_data.sort_index(inplace=False)
return input_data |
Nice, @UGuntupalli! Thanks for pasting in the "strawman" pieces. Sorry for the delay in getting back, I suddenly found myself with 5 talks in 8 days, with no idea how I got into that situation ^_^", but my OSS Fridays are back in order, so let me try to address your questions.
Now that I've seen # project_source/data/schemas.py
import pandera as pa
some_schema = pa.DataFrameSchema(...) Then, we have a "data loading function", in which all of the data processing steps are also encoded. This provides a nice shortcut to our dataset. # project_source/data/loaders.py
import pandas as pd
import janitor
from .schemas import some_schema
from pandera import check_output
@check_output(some_schema)
def load_some_data():
data = (
pd.read_csv(...)
.some_data_cleaning_funcs(...)
.some_other_data_cleaning_func(...) # keep chaining!
)
return data With pandera checking our data loading functions, there's runtime checks on every load. (If the checks get too slow, one can make it optional rather than automatic by removing the decorator.) Now, in reality, our workflow isn't so linear. The reality is that we might prototype the data loading function So I think what might actually happen is that a user loads some data, finds that it's dirty, cleans it up a bit using some mix of pandas and pyjanitor functions, and then sticks it in the data loading function... and rinse and repeat. At least, this has been the workflow I find myself in. That's where the use of the time series data cleaning functions can help, in the interactive cleaning phase; when the cleaning feels lightweight satisfying, those functions also can become part of the data loading function. Meanwhile, I like your Firstly, fork the repository to your GitHub username. At the moment, only core contributors with commit rights are given access to the main repo, so newcomer contributors use forks as a sandbox isolated from the main repo. There are setup instructions up at https://pyjanitor.readthedocs.io/contributing.html. A lot of first-timer contributors helped make that page, so I have a good prior it'll help a first-time contributor, but if you find warts in there, I'd encourage you to note them down and sneak them into the code PR alongside. Others on the team should be able to chime in here if you have issues getting set up. (I'm particularly bad at Windows, for example, as I don't have access to a Windows machine. Old Windows laptops that I get my hands on very quickly become Linux laptops.) We'll end up doing a review of the code on the PR tracker. @samukweku has been the most recent recipient of very in-depth and detailed reviews, and he knows how much we care about the code quality 😄, so just know that you might end up with a bunch of review comments from the rest of the dev team. There should be sufficient tooling to automate away some of the picky stuff, like code formatting. Reviews might take a bit of time, as we're all working on this on a volunteer basis, but I generally try to get in my review within 2-3 days of the PR being opened, and I ask for at least 2 reviews on the code, since the library has grown quite a bit. The library is intentionally developed slowly, btw, so that we can let some ideas simmer before making a decision. And if you feel like you want to hash out the API a bit more here before working on the PR, not a problem too - though I'd encourage you to open the PR and work it out on the PR thread. Having something "lingering" and "open" increases the impetus for us to get things in 😄. Alrighty, ball back in your court, @UGuntupalli! I appreciate you chiming in. Let's make something good happen! |
@ericmjl & @cosmicBboy: Brief description of the workflow as steps if they are not clear in the above screenshot:
However, since I don't have prior experience in delivering packages, I would lean on @ericmjl and @cosmicBboy to provide the guidance to refine the proposed workflow, so I can initiate a PR and get started. Thoughts ??? |
Hi @UGuntupalli! For the workflow, I read through it, and I found steps 3 and 4 to be unclear. In particular, "capture" is used in an imprecise fashion, and I'm unsure what action gets taken after "capturing". Additionally, what "ability to apply and populate the filtered data frame in pyjanitor" means is also unclear, as Because I was left scratching my head over the workflow, I instead took a bit more time to read through your discussion with @cosmicBboy on unionai-oss/pandera#249, and I think I'm starting to understand where you're coming from. The core that you're interested in is being able to report data errors, rather than simply validating data or actively cleaning data, is that right? Reporting data errors means scientists/engineers can do some interesting things for our "clients", such as giving them HTML reports of where things went wrong. This is what Reporting data errors requires a known schema defined beforehand, which is something Thinking about this, I definitely thing you have a series of PRs that could be made. There's precedence for I'd suggest a PR to pyjanitor that starts with one of As an example, def _flag_non_monotonic(series, increasing=True, strict=False, complement=False):
"""Utility function to flag rows that are not monotonically increasing/decreasing relative to the previous row."""
t_0 = series.iloc[0:-1]
t_1 = series.iloc[1:]
if increasing and strict:
check = t_1 >= t_0
if increasing and not strict:
check = t_1 > t_0
if decreasing and strict:
check = t_1 < t_0
if decreasing and not strict:
check = t_1 <= t_0
if complement:
check = ~check
return check
@pf.register_dataframe_method
def flag_non_monotonic(df, column_to_check, new_column_name, increasing=True, strict=False, complement=False):
"""Add a column that flags which columns are non-monotonic w.r.t. the rest of the rows."""
flags = pd.Series([np.nan])
check = flag_non_monotonic(df[column_to_check], increasing=increasing, strict=strict, complement=complement)
flags.append(check)
flags.index = df.index
return df.add_column(new_column_name, flags) As another example, def _flag_stuck(series):
t_0 = series.iloc[0:-1]
t_1 = series.iloc[1:]
stuck = t_0 == t_1
return stuck
@pf.register_dataframe_method
def flag_stuck(df, column_name, new_column_name):
flags = pd.Series([np.nan])
stuck = _flag_stuck(df[column_name])
flags.append(stuck)
flags.index = df.index
return df.add_column(new_column_name, flags) Quoting from @cosmicBboy in unionai-oss/pandera#249, the key interface that matters is this:
The utility functions can help where pandas does not provide them, i.e. I would defer to schema = pa.DataFrameSchema(
columns={
"monotonic_column": ..., checks=pa.Check(lambda s: s.is_monotonic()),
"non_stuck_column": ..., checks=pa.Check(lambda s: _flag_stuck(s)),
}
) In that way, both data validation and bad data reporting can be satisfied. Depending on whether @cosmicBboy wants a something that validates whether there are "stuck" values in pandera or not (there could be a consideration to keeping the package extensible rather than "god-like"), you could make a PR into You'll notice I mentioned nothing about doing the visualization part - as I've learned, viz is hard to standardize - everybody wants their own take :). It's best to delegate this to the end-user. I think you've got a good core of ideas for a timeseries data cleaning module, with appropriate utility functions that can be easily slotted in as a |
Hey @UGuntupalli and @ericmjl thanks for developing this idea further! I do think the strength of One question I have @ericmjl is whether it's common/idiomatic in
As for the One thing that |
Let us see if I can take one last stab at understanding everyone's recommendations, so I don't open multiple PR's in the wrong places. First to clarify @ericmjl concerns around clarity, let me try and offer a better explanation around steps 3 and 4, even though we many not really use it anymore:
As for @cosmicBboy 's comments, if
If you both agree, here is what I am going to do:
Are we on the same page ? As soon as you confirm, I will start opening PR's unless there is a disagreement.. |
Wonderful discussion, everybody 😄. This is why I like dabbling in the OSS world!
Yes, the private functions are idiomatic! There are some that are floating around in the library. I believe our contributors have placed them in the logical places, such as right below the method-chained function, so they should be discoverable easily. There may be others that could logically be factored out, and I'd be happy to look at PRs that refactor them out and quickly cut releases once merged. I was concerned about whether you might want to add pyjanitor as a dependency or not. Keeping packages nice and isolated is a nice thing, especially if there's only a few functions that you need, and not the whole library functionality. That said, we all do import pandas and numpy already all the time. :)
@UGuntupalli that sounds great to me! To make things simple, I'd advise doing one function from start to finish. The sense of accomplishment you'll have might be greater that way, seeing something to completion. Please feel free to pick the one you're sensing is most urgent, and run with it! Once the pattern of development is clear, the subsequent functions should be much easier to contribute. Looking forward to seeing what you've got! |
Cool! I'll hold off on deciding whether to add
@UGuntupalli sounds good. One last thing I'd like to add to this discussion, after thinking about it a little more, is that I may have stumbled on a nice interface for
Consider a decorator called something like import numpy as np
import pandas as pd
import pandera as pa
schema = pa.DataFrameSchema({
"col1": Column(checks=pa.Check.greater_than_or_equal_to(0, name="positive")),
"col2": Column(checks=pa.Check.isin(["a", "b", "c"], name="category_abc"))
})
@pa.parser(schema)
def clean_data(df, failed):
"""
:param df: dataframe to clean
:param failed: passed in by `pa.parser` decorator. A boolean dataframe with
the same index as df, where columns are check names. True indicates
failure cases.
"""
clean_df = (
# replace negative values with nans
df.update_where(failed["positive"], "col1", np.nan)
# filter out records with unknown categories
.filter_on(failed["category_abc"], complement=True)
)
return clean_df
def load_data(file):
return (
pd.read_csv(file)
.pipe(clean_data)
# visualization, modeling, etc.
...
) What
It's possible that this may be a little over-engineered 😉 but the more I consider this example the more I realize that this would be really quite helpful in my own work. There would have to be a few changes to |
@cosmicBboy, |
Uh oh!
There was an error while loading. Please reload this page.
I just came across
pyjanitor
as I was trying to do some research around publicly available data-cleaning libraries in Python. First of all, thanks a lot for all the effort that has been made in developing this package and the continued effort that goes into maintaining it.I would like to propose the addition of a new segment of methods or functions- let us call it
time_series_data_cleaning
, that add some metadata about the quality of the data and helps performing data cleaning very easy. If by any chance, you don't feel that these belong here, can you suggest if you are aware of any similar packages that you might have come across which may offer this functionality ?Example API
Here are some basic functions that I think would be valuable to have in this section:
This package (https://pecos.readthedocs.io/en/latest/installation.html) offers some of the functionality I am talking about, however some of the filters are slow and don't have some of the great extensibility that pyjanitor already has.
The text was updated successfully, but these errors were encountered: