Skip to content

Add ability to generate boolean flags about data quality #703

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
UGuntupalli opened this issue Jul 21, 2020 · 11 comments
Closed

Add ability to generate boolean flags about data quality #703

UGuntupalli opened this issue Jul 21, 2020 · 11 comments
Labels

Comments

@UGuntupalli
Copy link
Contributor

UGuntupalli commented Jul 21, 2020

I just came across pyjanitor as I was trying to do some research around publicly available data-cleaning libraries in Python. First of all, thanks a lot for all the effort that has been made in developing this package and the continued effort that goes into maintaining it.

I would like to propose the addition of a new segment of methods or functions- let us call it time_series_data_cleaning, that add some metadata about the quality of the data and helps performing data cleaning very easy. If by any chance, you don't feel that these belong here, can you suggest if you are aware of any similar packages that you might have come across which may offer this functionality ?

Example API

Here are some basic functions that I think would be valuable to have in this section:

  1. Missing Timestamps - returns a boolean provided the resolution of the dataframe if any timestamps are found to be missing
  2. Montonic Timestamps - returns a boolean provided the dates in the data frame are not monotonic
  3. Duplicate Timestamps - returns a boolean flagging duplicates in the data frame with an optional argument to reindex the dataframe
  4. Range - returns a boolean flagging values in the data frame that are outside a defined range
  5. Stuck - returns a boolean flagging values in the data frame that are stuck for "n" timestamps where "n" is a user input
  6. Jump - returns a boolean flagging values in the data frame that see arbitrary jumps from timestamp to timestamp, where an 'acceptable' range is provided by the user
# Let us say we want to check for missing timestamps 
df1 = df.check_missing_timestamps(column_name='') 
# By default the function will check the index of the input data frame. In the event that column_name is provided, 
# it will stop looking at the index and look at the defined column for timestamps 
# df1 would have a single column assuming it is not a multi-index dataframe with a True to indicate a missing  #timestamp 

df1 = df.check_monotonic_timestamps(column_name='')  # Check for monotonicity 

df1 = df.check_duplicate_timestamps(reindex=True, keep="First") 
# Will drop duplicates if flag is True and use keep to determine if the first occurrence or the last occurrence will #be retained 

# more examples below
df1 = df.check_if_data_in_range(bound=[lower_limit, upper_limt])

# df1 will have the same number of columns as df except that they will be booleans  

This package (https://pecos.readthedocs.io/en/latest/installation.html) offers some of the functionality I am talking about, however some of the filters are slow and don't have some of the great extensibility that pyjanitor already has.

@ghost ghost added the triage label Jul 21, 2020
@ericmjl
Copy link
Member

ericmjl commented Jul 22, 2020

Hi @UGuntupalli!

Thanks for chiming in. I'm super glad you like the package 😄.

For data checks, I only recently found the package pandera, for which I think is the better place for data checking over data cleaning. Though the distinction may feel a bit arbitrary, please hear me out, the distinction makes sense only after thinking through the acts of checking vs. cleaning.

In "checking" data, we are asserting that properties of the data are correct. I think this is what you're trying to accomplish with the code above. Thinking in deeper theoretical/probabilistic terms, I think there are parallels to evaluating the likelihood of data against an assumed probability distribution model. If the data fall outside of the support of the assumed probability model, we error out.

In "cleaning" data, we are not evaluating data against an assumed model of how the data ought to be, but rather changing the distribution and/or indexing of the data. In the R world, the "mutate" term is used as the overarching notion; in pyjanitor, we've sort of adopted the Pythonic philosophy of being explicit and provide a library of functions for that.

Early on, I was tempted to add in data checks into pyjanitor, but once I saw pandera, gave it a test-drive, and thought about the distinction for a few more days, I can see why I would instead defer to putting the checks into pandera.

That said, this doesn't preclude some good ideas you've raised becoming a janitor.timeseries suite of data cleaning functions.

I have a few function ideas inspired by what you wrote. Perhaps you can critique them, and if you're up for contributing, I'd be happy to help onboard you through the development process to get them in. (It's designed to be simultaneously beginner-friendly to get started, but also beginner-educational, in that we do adhere to "good software development practices" that most beginners won't have. If you're seasoned with code, you'll be right at home!) And if you have others for the cleaning part of data analysis, please feel free to suggest!

Idea 1: .sort_monotonically()

Basically a much more descriptive alias for .sort_values() or .sort_index(). Is it idiomatic for timeseries data to have the timestamps as the index? If so, it's worthwhile just delegating by default to .sort_index().

Idea 2: .get_jumps()

Return rows where there are jumps in between the time stamps. A bit like .get_dupes()

Idea 3: An extension of get_jumps() to get_()

Basically letting you inspect the rows that are problematic.

What do you think? (I had to cut my thoughts to join a meeting, so please chime in with your ideas!)

Timeseries stuff are on my radar, as I've never had to deal with data cleaning issues on them before. I'd love to share and learn from you too!

@UGuntupalli
Copy link
Contributor Author

@ericmjl,
Thank you for your detailed response. I completely agree with the distinction that you have made between data_cleaning (an action) vs data_checking (a question/ logical test / inspection). I did take a brief look at pandera and it does look like a good home for the functionality I have suggested.
As for the ideas that you have, could you clarify if you expect the user manually check if it is worth performing an action ? Let me try and explain this with an example.

import pandas as pd
import numpy as np
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.random.randint(0,100,size=(len(date_rng)))
print(df)

The above code would generate something like the dataframe shown below. Now, before the user uses the function .sort_monotically(),

  1. Do we expect the user to check if it is worth calling the function ?
  2. Do we lean on pandera like package to test that when the function is called ?
  3. Do we implement that logic ourselves ?
      date               data
0   2018-01-01 00:00:00    67
1   2018-01-01 01:00:00    11
2   2018-01-01 02:00:00    75
3   2018-01-01 03:00:00    78
4   2018-01-01 04:00:00    35
..                  ...   ...
164 2018-01-07 20:00:00    19
165 2018-01-07 21:00:00    45
166 2018-01-07 22:00:00    14
167 2018-01-07 23:00:00    67
168 2018-01-08 00:00:00    96

Based on your explanation, I would think leaning towards (1) or (2) would make sense, but I am curious to hear what your thoughts are. As for .get_jumps(), my question would be how will the user be able to use the result ? The user gets much more power if the result of .get_jumps() is left as a boolean/ boolean dataframe, the size of the input data typically to enable the visualization of the impact of the filter. So, just returning the rows may not be completely helpful, unless they also receive a dataframe of booleans to enable boolean indexing. This again leads me to the previous question on, do we want to outsource that functionality to a different package or assume that the user is responsible for it ?

Lastly, as for contributing to the code base, while I can't make a firm time commitment, if you can show me the ropes, I am happy to start slow with one function at a time. But, curious to hear your thoughts on the above.

Cheers
Uday

@UGuntupalli
Copy link
Contributor Author

UGuntupalli commented Jul 22, 2020

@ericmjl :
Here is an example for .sort_monotonically()

import pandas as pd
from random import random

# Build a random data set
ts_index = pd.date_range('1/1/2000', periods=1000, freq='T')
v1 = [random() for i in range(1000)]
v2 = [random() for i in range(1000)]
v3 = [random() for i in range(1000)]
ts_df = pd.DataFrame({'v1':v1,'v2':v2,'v3':v3},index=ts_index)

# Test for timestamps
def test_for_monotonicity(input_data, column_name=None, direction='increasing'):
    """
    Tests whether the pd.DataFrame is monotonically increasing or decreasing depending on the direction for 
    which the test needs to be performed.

    :param pd.DataFrame input_data: dataframe to be tested
    :param str column_name: needs to be specified if and only if the date time is not in index. defaults to None
    :param str direction: specifies the direction in which monotonicity is being tested for. defaults to 'increasing'
    :return: single boolean flag indicating whether the test has pssed or not
    :rtype: bool
    """
    # Test if the input is a data frame
    assert isinstance(input_data, pd.DataFrame), "input_data must be a pandas dataframe"

    # Test whether the column is pd.Timestamp
    if column_name:
        assert isinstance(input_data.index, pd.Timestamp), "column_name should be pandas timestamp"
        input_data.index.name = "idx"
        input_data.reset_index(inplace=True)
    else:
        assert isinstance(input_data[[column_name]], pd.Timestamp),  "column_name should be pandas timestamp"
        input_data.rename(columns={column_name: "idx"}, inplace=True)

    # Test for monotonicity
    if direction == 'increasing':
        return input_data["idx"].is_monotonic()
    elif direction == 'decreasing':
        return input_data["idx"].is_monotonic_decreasing()
    else:
        raise ValueError("\n Argument direction accepts only increasing or decreasing ")


def sort_monotonically(input_data, column_name=None, direction='increasing'):
    """
    Sorts the input dataframe monotonically based on the desired direction. It assumes that the dataframe has a
    pd.TimeIndex as its index

    :param pd.DataFrame input_data: dataframe to be tested
    :param str column_name: needs to be specified if and only if the date time is not in index. defaults to None
    :param str direction: specifies the direction in which monotonicity is being tested for. defaults to 'increasing'
    :return: dataframe with its index sorted
    :rtype: pd.DataFrame
    """
    test_monotonicity = test_for_monotonicity(
        input_data=input_data,
        column_name=column_name,
        direction=direction
    )

    # Standardize the data frame
    if column_name:
        input_data.reset_index(column_name, inplace=True)

    if test_monotonicity:
        return input_data
    else:
        if direction == "increasing":
            input_data.sort_index(inplace=True)
        else:
            input_data.sort_index(inplace=False)
        return input_data

@ericmjl
Copy link
Member

ericmjl commented Jul 24, 2020

Nice, @UGuntupalli! Thanks for pasting in the "strawman" pieces.

Sorry for the delay in getting back, I suddenly found myself with 5 talks in 8 days, with no idea how I got into that situation ^_^", but my OSS Fridays are back in order, so let me try to address your questions.

Do we expect the user to check if it is worth calling the function ?
Do we lean on pandera like package to test that when the function is called ?
Do we implement that logic ourselves ?

Now that I've seen pandera, I see it in the following context of "good workflow practices". Here's an example. In "logical" order, we start with a pandera.DataFrameSchema:

# project_source/data/schemas.py
import pandera as pa
some_schema = pa.DataFrameSchema(...)

Then, we have a "data loading function", in which all of the data processing steps are also encoded. This provides a nice shortcut to our dataset.

# project_source/data/loaders.py

import pandas as pd
import janitor
from .schemas import some_schema
from pandera import check_output

@check_output(some_schema)
def load_some_data():
    data = (
        pd.read_csv(...)
        .some_data_cleaning_funcs(...)
        .some_other_data_cleaning_func(...) # keep chaining!
    )
    return data

With pandera checking our data loading functions, there's runtime checks on every load. (If the checks get too slow, one can make it optional rather than automatic by removing the decorator.)

Now, in reality, our workflow isn't so linear. The reality is that we might prototype the data loading function load_some_data() inside a notebook, try to get it right, and then stick it in our data loading function library so that we have a suite of "shortcuts" to the exact data file we need.

So I think what might actually happen is that a user loads some data, finds that it's dirty, cleans it up a bit using some mix of pandas and pyjanitor functions, and then sticks it in the data loading function... and rinse and repeat. At least, this has been the workflow I find myself in. That's where the use of the time series data cleaning functions can help, in the interactive cleaning phase; when the cleaning feels lightweight satisfying, those functions also can become part of the data loading function.


Meanwhile, I like your sort_monotonically() implementation. There are pieces that could be changed to better fit the coding idioms in the pyjanitor function lib, such as adding type annotations and removing types from docstrings, and we should get some consensus on the exact names used for args/kwargs and their options, as being thoughtful and considerate to non-native English language speakers about the naming conventions is part of the API design philosophy of pyjanitor. (Also, once those decisions are made, they tend to be sticky and hard to remove without breaking people's code.) Would you like to work through getting it in? If so, here's the starting points.

Firstly, fork the repository to your GitHub username. At the moment, only core contributors with commit rights are given access to the main repo, so newcomer contributors use forks as a sandbox isolated from the main repo.

There are setup instructions up at https://pyjanitor.readthedocs.io/contributing.html. A lot of first-timer contributors helped make that page, so I have a good prior it'll help a first-time contributor, but if you find warts in there, I'd encourage you to note them down and sneak them into the code PR alongside.

Others on the team should be able to chime in here if you have issues getting set up. (I'm particularly bad at Windows, for example, as I don't have access to a Windows machine. Old Windows laptops that I get my hands on very quickly become Linux laptops.)

We'll end up doing a review of the code on the PR tracker. @samukweku has been the most recent recipient of very in-depth and detailed reviews, and he knows how much we care about the code quality 😄, so just know that you might end up with a bunch of review comments from the rest of the dev team. There should be sufficient tooling to automate away some of the picky stuff, like code formatting. Reviews might take a bit of time, as we're all working on this on a volunteer basis, but I generally try to get in my review within 2-3 days of the PR being opened, and I ask for at least 2 reviews on the code, since the library has grown quite a bit. The library is intentionally developed slowly, btw, so that we can let some ideas simmer before making a decision.

And if you feel like you want to hash out the API a bit more here before working on the PR, not a problem too - though I'd encourage you to open the PR and work it out on the PR thread. Having something "lingering" and "open" increases the impetus for us to get things in 😄.

Alrighty, ball back in your court, @UGuntupalli! I appreciate you chiming in. Let's make something good happen!

@UGuntupalli
Copy link
Contributor Author

UGuntupalli commented Jul 25, 2020

@ericmjl & @cosmicBboy:
So, I have given this thought and based on both of your comments, I wanted to propose a collaborative workflow and see what your thoughts/ suggestions were. For the proposed workflow, I draw inspiration from pecos pacakge (https://pecos.readthedocs.io/en/latest/installation.html) which is where I started my journey for this problem.

chrome_YfQ946eHxN

Brief description of the workflow as steps if they are not clear in the above screenshot:

  1. Step 1: Import the data into a pandas data frame using native pandas parsers like pd.read_csv, pd.read_excel, pd.read_sql etc.
  2. Step 2: Define a pandera schema for the data frame that has been imported
  3. Step 3: Using a pandera.Parser class like @cosmicBboy recommended or a simpler collections.namedtuple , we capture the following information:
    - Raw Data frame
    - Boolean Masks (can be multiple and of different sizes, but most typically will be the size of raw data depending on the pandera schema defined) , for e.g.
    - timestamp_mask
    - missing_data_mask
    - range_mask
    - Filtered Data frame ( place holder empty dataframes)
  4. Step 4: We should have the ability to apply and populate the filtered data frame in pyjanitor to then call methods using pyjanitor to apply the masks and thus populate Filtered Data frame. I am actually debating if this overcomplicates things or if step 4 should also be contained with pandera.

However, since I don't have prior experience in delivering packages, I would lean on @ericmjl and @cosmicBboy to provide the guidance to refine the proposed workflow, so I can initiate a PR and get started. Thoughts ???

@ericmjl
Copy link
Member

ericmjl commented Jul 26, 2020

Hi @UGuntupalli!

For the workflow, I read through it, and I found steps 3 and 4 to be unclear. In particular, "capture" is used in an imprecise fashion, and I'm unsure what action gets taken after "capturing". Additionally, what "ability to apply and populate the filtered data frame in pyjanitor" means is also unclear, as pyjanitor simply provides the API to actively modify a dataframe to make it cleaned up. Workflow matters tend to be quite personalized to the problem, I think, so I think I should refrain from commenting further on it.

Because I was left scratching my head over the workflow, I instead took a bit more time to read through your discussion with @cosmicBboy on unionai-oss/pandera#249, and I think I'm starting to understand where you're coming from.

The core that you're interested in is being able to report data errors, rather than simply validating data or actively cleaning data, is that right?

Reporting data errors means scientists/engineers can do some interesting things for our "clients", such as giving them HTML reports of where things went wrong. This is what pecos tries to do.

Reporting data errors requires a known schema defined beforehand, which is something pandera excels at and is the appropriate place to be. Data error reporting then needs a way of "flagging" the errors. To build fully automatic functionality against a schema, I'm not sure how to do yet, but I can think of the foundational pieces that are necessary.

Thinking about this, I definitely thing you have a series of PRs that could be made.

There's precedence for flag_X inside pyjanitor: we have a flag_nulls function, which adds a new column of 1/0s to the DataFrame which one can query against.

I'd suggest a PR to pyjanitor that starts with one of flag_non_monotonic, flag_jumps or flag_stuck (because you're new to OSS contribution, it's easier to start small). The interface should follow that of flag_nulls, in that it takes in a DataFrame, and returns a modified DataFrame (by default) that has an additional column of booleans.

As an example, flag_non_monotonic could look something like this:

def _flag_non_monotonic(series, increasing=True, strict=False, complement=False):
    """Utility function to flag rows that are not monotonically increasing/decreasing relative to the previous row."""
    t_0 = series.iloc[0:-1]
    t_1 = series.iloc[1:]

    if increasing and strict:
        check = t_1 >= t_0
    if increasing and not strict:
        check = t_1 > t_0
    if decreasing and strict:
        check = t_1 < t_0
    if decreasing and not strict:
        check = t_1 <= t_0

    if complement:
        check = ~check
    return check


@pf.register_dataframe_method
def flag_non_monotonic(df, column_to_check, new_column_name, increasing=True, strict=False, complement=False):
    """Add a column that flags which columns are non-monotonic w.r.t. the rest of the rows."""
    flags = pd.Series([np.nan])
    check = flag_non_monotonic(df[column_to_check], increasing=increasing, strict=strict, complement=complement)
    flags.append(check)
    flags.index = df.index
    return df.add_column(new_column_name, flags)

As another example, flag_stuck might look like this:

def _flag_stuck(series):
    t_0 = series.iloc[0:-1]
    t_1 = series.iloc[1:]
    stuck = t_0 == t_1
    return stuck


@pf.register_dataframe_method
def flag_stuck(df, column_name, new_column_name):
    flags = pd.Series([np.nan])
    stuck = _flag_stuck(df[column_name])
    flags.append(stuck)
    flags.index = df.index
    return df.add_column(new_column_name, flags)

Quoting from @cosmicBboy in unionai-oss/pandera#249, the key interface that matters is this:

If one can express it in terms of a function that takes a pd.Series or pd.DataFrame as an input and outputs a boolean scalar, Series, or DataFrame, pandera.Check can support it :)

The utility functions can help where pandas does not provide them, i.e. I would defer to pd.Series.is_monotonic inside a pandera check. So the end-use might look like this:

schema = pa.DataFrameSchema(
    columns={
        "monotonic_column": ..., checks=pa.Check(lambda s: s.is_monotonic()),
        "non_stuck_column": ..., checks=pa.Check(lambda s: _flag_stuck(s)),
    }
)

In that way, both data validation and bad data reporting can be satisfied. Depending on whether @cosmicBboy wants a something that validates whether there are "stuck" values in pandera or not (there could be a consideration to keeping the package extensible rather than "god-like"), you could make a PR into pandera as well as appropriate!

You'll notice I mentioned nothing about doing the visualization part - as I've learned, viz is hard to standardize - everybody wants their own take :). It's best to delegate this to the end-user.

I think you've got a good core of ideas for a timeseries data cleaning module, with appropriate utility functions that can be easily slotted in as a pandera.Check via lambda expressions. What do you think, are you ready to make a PR?

@cosmicBboy
Copy link

cosmicBboy commented Jul 26, 2020

Hey @UGuntupalli and @ericmjl thanks for developing this idea further! I do think the strength of pandera is its flexibility with how one expresses checks, and since the pandas API already has a lot of boolean flagging functionality, most checks can be expressed really concisely with lambda functions. The main reason to add built-in Check methods is to support schema inference and yaml/python script serialization. As you can see here the built-in checks are thin wrappers around the underlying pandas functionality.

One question I have @ericmjl is whether it's common/idiomatic in pyjanitor to use the private _flag_stuck functions? It would be helpful in pandera to call those implementations directly without having to re-implement.

Using a pandera.Parser class like @cosmicBboy recommended or a simpler collections.namedtuple

As for the Parser idea, I do think that it's sort of out of scope for pandera. data loading [pandas] -> cleaning -> [pyjanitor] -> validation [pandera] is the cleanest way to separate concerns. We can view parsing, imputation, filtering as special cases of data cleaning, and I'm not sure pandera should be responsible for that.

One thing that pandera could provide is an interface to access the boolean results from all the Checks in the schema, which would enable the user to use the element-level check results for their own purpose.

@UGuntupalli
Copy link
Contributor Author

UGuntupalli commented Jul 26, 2020

Let us see if I can take one last stab at understanding everyone's recommendations, so I don't open multiple PR's in the wrong places.

First to clarify @ericmjl concerns around clarity, let me try and offer a better explanation around steps 3 and 4, even though we many not really use it anymore:

  • The meaning of "capture" that I was intended to convey in step 3 is three-fold.
  1. Loading/Assigning raw data into an object like structure
  2. Running Data Validation against the raw data and generating the booleans and saving them to the object structure
  3. Initialize a Filtered Data frame which would contain the result of applying the data filtering action of the booleans computed on the raw data, but not necessarily applying them yet
  • The intent of step 4 that I was trying to communicate was - "ability to apply and populate the filtered data frame in pyjanitor" means is also unclear" = perform data filtering using the booleans on the raw_data which I think pyjanitor is well-suited for as you highlight because of its ability to extend the Pandas API to modify the data frame.

  • Additionally, the core of what I am interested in is data_checking logic through a good interface which pyjanitor offers as an extension of pandas API interface. Once the data_checking logic is established, I think data_cleaning and data_visualization can be performed with the tools that pandas and plotly offer.

  • Lastly, I appreciate and completely agree that we should not attempt to standardize the data visualization part because it is very dependent on the end-user

As for @cosmicBboy 's comments, if Parser idea feels out of scope, I am curious where to get started. Thank you for providing this concise format to express, I wish I thought of it, but this is what I am thinking of after seeing both your responses, where I initially thought pandera could be a better home for the data checking functionality.

data loading [pandas] -> data checking[pyjanitor] -> data cleaning[pyjanitor] -> data validation [pandera]

If you both agree, here is what I am going to do:

  1. Create PR's in pyjanitor for data checking first, once I get the hang of it
  2. Then expand it to data cleaning PR's in pyjanitor
  3. Finally, we can then look at one/more PR's in pandera to round off any missing pieces in the validation

Are we on the same page ? As soon as you confirm, I will start opening PR's unless there is a disagreement..

@ericmjl
Copy link
Member

ericmjl commented Jul 26, 2020

Wonderful discussion, everybody 😄. This is why I like dabbling in the OSS world!

One question I have @ericmjl is whether it's common/idiomatic in pyjanitor to use the private _flag_stuck functions? It would be helpful in pandera to call those implementations directly without having to re-implement.

Yes, the private functions are idiomatic! There are some that are floating around in the library. I believe our contributors have placed them in the logical places, such as right below the method-chained function, so they should be discoverable easily. There may be others that could logically be factored out, and I'd be happy to look at PRs that refactor them out and quickly cut releases once merged.

I was concerned about whether you might want to add pyjanitor as a dependency or not. Keeping packages nice and isolated is a nice thing, especially if there's only a few functions that you need, and not the whole library functionality. That said, we all do import pandas and numpy already all the time. :)

If you both agree, here is what I am going to do:

  1. Create PR's in pyjanitor for data checking first, once I get the hang of it
  2. Then expand it to data cleaning PR's in pyjanitor
  3. Finally, we can then look at one/more PR's in pandera to round off any missing pieces in the validation

@UGuntupalli that sounds great to me! To make things simple, I'd advise doing one function from start to finish. The sense of accomplishment you'll have might be greater that way, seeing something to completion. Please feel free to pick the one you're sensing is most urgent, and run with it! Once the pattern of development is clear, the subsequent functions should be much easier to contribute. Looking forward to seeing what you've got!

@cosmicBboy
Copy link

cosmicBboy commented Jul 26, 2020

Yes, the private functions are idiomatic!

Cool! I'll hold off on deciding whether to add pyjanitor or not to pandera deps, depending on how things play out with the current plan. If anything, a user who has both pyjanitor and pandera in their env can create Checks using those private functions anyway.

If you both agree, here is what I am going to do:

  1. Create PR's in pyjanitor for data checking first, once I get the hang of it
  2. Then expand it to data cleaning PR's in pyjanitor
  3. Finally, we can then look at one/more PR's in pandera to round off any missing pieces in the validation

@UGuntupalli sounds good.

One last thing I'd like to add to this discussion, after thinking about it a little more, is that I may have stumbled on a nice interface for pandera to fulfill this use case:

Running Data Validation against the raw data and generating the booleans and saving them to the object structure

Consider a decorator called something like parser:

import numpy as np
import pandas as pd
import pandera as pa


schema = pa.DataFrameSchema({
    "col1": Column(checks=pa.Check.greater_than_or_equal_to(0, name="positive")),
    "col2": Column(checks=pa.Check.isin(["a", "b", "c"], name="category_abc"))
})


@pa.parser(schema)
def clean_data(df, failed):
    """
    :param df: dataframe to clean
    :param failed: passed in by `pa.parser` decorator. A boolean dataframe with
        the same index as df, where columns are check names. True indicates
        failure cases.
    """
    clean_df = (
        # replace negative values with nans
        df.update_where(failed["positive"], "col1", np.nan)
        # filter out records with unknown categories
        .filter_on(failed["category_abc"], complement=True)
    )
    return clean_df


def load_data(file):
    return (
        pd.read_csv(file)
        .pipe(clean_data)
        # visualization, modeling, etc.
        ...
    )

What pa.parser does is basically combine check_input and check_output with some extra semantics:

  1. the schema is validated on the decorated function's input, producing the boolean vectors that @UGuntupalli needs to implement step 3 and 4
  2. the parser decorator then passes in the results in a dataframe called failed in this example
  3. the function body clean_data is responsible for cleaning the data so that those failure cases are amended somehow
  4. the parser decorator then re-executes data validation to make sure the function implements the correct data cleaning logic.

It's possible that this may be a little over-engineered 😉 but the more I consider this example the more I realize that this would be really quite helpful in my own work. There would have to be a few changes to pandera to fulfill this functionality, let me know if y'all have any thoughts on this idea!

@UGuntupalli
Copy link
Contributor Author

@cosmicBboy,
I am going to start with some PR's to implement the basic functionality. You and @ericmjl are welcome to guide me to adopt either the pandera Parser structure you are proposing or stick with the PR's after they pass the code review in pyjanitor. I don't have any suggestions on this. If either of you don't as well, I will close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants