Skip to content

Time taken to complete schema validation is more when the number of NULL values are more in the dataset #652

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Lavi2015 opened this issue Oct 13, 2021 · 12 comments
Labels
question Further information is requested

Comments

@Lavi2015
Copy link

Lavi2015 commented Oct 13, 2021

Hi,
I am trying to use pandera for schema validation and it works fine and completes within seconds most of the time.
My raw dataset is around 250MB size and dataframe shape is (594384, 42) and I am testing with nullable=False with few columns to test the performance.

When I try to run schema validation with a column where data points are null for 594338 out 594384 records, (essentially only 46 columns have data) and hence I am trying to write 594338 into rejected.csv, my script is never able to complete. I have also observed similar situations on a few other columns as testing purposes. If the number of NULL rows are more, it's taking a longer time to complete the schema check. I don't have a big data platform to test and develop and hence may not be able to try Fugue.

In another example in the aforesaid dataset, when the number of NULL values in one particular column is around 169337 (out of 594384 rows), it took almost 17 mins for pandera to complete the schema validation. I am basically doing lazy validation and segregate all failed rows into a dataframe for further evaluation.

try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
    print("Schema errors and failure cases:")
    err.failure_cases.to_csv("rejected.csv", index=False)

I am trying to automate to see if I can complete the whole process in AWS lambda function in an automated fashion. Will try to download data from S3, run lambda function for schema validation and write the rejected.csv if any into S3. Hence time taken to complete pandera schema validation is crucial.

Do you have any suggestions to optimize and complete schema validation? With my testing, I do see this taking a long time to complete only if the number of null values are more in the dataset, this could be a big issue to resolve.

Thanks and Regards

@Lavi2015 Lavi2015 added the question Further information is requested label Oct 13, 2021
@cosmicBboy
Copy link
Collaborator

Hi @Lavi2015 thanks for the question! this looks like a pretty nasty performance issue. I'm trying to reproduce the issue and it would help if you can provide me with:

  1. an equivalent example of the schema that you're using? Ideally it would have all the data types and checks that you're trying to apply.
  2. a representating toy dataset (written as a pandas.DataFrame literal or csv string) that I would be able to scale up with pd.concat([df] * n)

@cosmicBboy
Copy link
Collaborator

I was able to reproduce your issue with the following script:

import time

import pandas as pd
import pandera as pa
from matplotlib import pyplot as plt


schema = pa.DataFrameSchema({"foo": pa.Column(float, nullable=False)})
times = {}

n_datapoints = [10, 100, 1000, 10_000, 100_000, 1_000_000, 10_000_000]
for n in n_datapoints:
    df = pd.DataFrame({"foo": [None] * n + [1.0] * 10}).astype(float)
    start = time.time()
    try:
        print(f"validating df with {n} datapoints")
        print(df.head())
        print(df.dtypes)
        print(schema.validate(df, lazy=True))
    except Exception as e:
        print(e.failure_cases.shape)
    finally:
        runtime = time.time() - start
        print(f"time: {runtime}\n")
        times[n] = runtime

fig, ax = plt.subplots()
series = pd.Series(times)

linear_scaling = []
for n in n_datapoints:
    scaling_factor = n / n_datapoints[0]
    linear_scaling.append(series[n_datapoints[0]] * scaling_factor)
linear_scaling = pd.Series(linear_scaling, index=n_datapoints)
print(series)
print(linear_scaling)
series.plot(ax=ax, logx=True, logy=True, label="pandera validation runtime")
linear_scaling.plot(ax=ax, logx=True, logy=True, label="linear scaling")
ax.set_xlabel("n datapoints")
ax.set_ylabel("seconds")
plt.legend()
plt.savefig("foo_runtimes_2.png")
plt.close(fig)

The problem occurred right here: https://github.com/pandera-dev/pandera/blob/master/pandera/errors.py#L107-L108

Since na values are not equal to each other, the failure_cases function produced a list of all the failure cases instead of producing a list of a single na value.

I'll push up a fix for this, but for posterity, I did some basic profiling:

Before fix: Awful scaling for cases with a lot of null values

foo_runtimes_before

After fix: Sublinear scaling

foo_runtimes_after

As you can see, both runtime complexities follow an exponential curve but the scaling factor before blows up much more quickly than after.

To give a better sense of runtimes at really large numbers of datapoints (which would take too long for the before case) here's a plot of the runtime after the fix from 10 to 10^8 datapoints:

foo_runtimes_after_scaling

cosmicBboy added a commit that referenced this issue Oct 14, 2021
fixes #652

This PR fixes an issue where setting `lazy=True` with a schema
where `nullable=False` and there are lot of null values causes
severe performance issues in the ~500,000 row dataframe case.

The fix is to drop duplicates when aggregating failure cases
and removing unnecessary data processing of lazily collected
failure cases.
cosmicBboy added a commit that referenced this issue Oct 14, 2021
* improve lazy validation performance for nullable cases

fixes #652

This PR fixes an issue where setting `lazy=True` with a schema
where `nullable=False` and there are lot of null values causes
severe performance issues in the ~500,000 row dataframe case.

The fix is to drop duplicates when aggregating failure cases
and removing unnecessary data processing of lazily collected
failure cases.

* reintroduce sorting/dropping of duplicates
@Lavi2015
Copy link
Author

Lavi2015 commented Oct 14, 2021

Hi @cosmicBboy , Thanks for your response and appreciate the details.
Please let me know when this fix will be pushed. I will test and update here. Thx again!

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Oct 14, 2021

Hi @Lavi2015 the fix #655 should be available now on the dev branch. Please check it out via a development installation (make sure to checkout the dev branch) and let me know if that fixes your issue!

@Lavi2015
Copy link
Author

Lavi2015 commented Oct 20, 2021

Hi @cosmicBboy , Thank you so much for the release.
I installed the dev version and completed the testing for the same dataset as mentioned earlier:

In another example in the aforesaid dataset, when the number of NULL values in one particular column is around 169337 (out of 594384 rows), it took almost 17 mins for pandera to complete the schema validation. I am basically doing lazy validation and segregate all failed rows into a dataframe for further evaluation.

With the latest dev version, it's taking around 15 mins to complete the schema check. Please let me know if you need any further details. Thanks again!

@cosmicBboy
Copy link
Collaborator

To see if there's any way to speed things up, can you please provide:

  1. an equivalent example of the schema that you're using, ideally it would have all the data types and checks that you're trying to apply.
  2. a representative toy dataset (written as a pandas.DataFrame literal or csv string) that I would be able to scale up with pd.concat([df] * n) that produces the same runtime behavior

On your side, just to make sure we have similar runtimes, could you run the code below and copy-paste the output here?

import time

import pandas as pd
import pandera as pa
from matplotlib import pyplot as plt


schema = pa.DataFrameSchema({
    "\d+": pa.Column(float, nullable=False, regex=True)
})
times = {}
df_allocation_times = {}

n_datapoints = [10, 100, 1000, 10_000, 100_000, 1_000_000, 10_000_000]
for n in n_datapoints:
    start = time.time()
    df = pd.DataFrame(
        {
            i: [None] * n + [1.0] * 10
            # 50 columns, n + 10 rows
            for i in range(50)
        }
    ).astype(float)
    df_allocation_times[n] = time.time() - start

    start = time.time()
    try:
        print(f"validating df with {n} x 50 datapoints")
        print(schema.validate(df, lazy=True))
    except Exception as exc:
        print(exc.failure_cases.shape)
    finally:
        runtime = time.time() - start
        print(f"time: {runtime}\n")
        times[n] = runtime

fig, ax = plt.subplots()
series = pd.Series(times)

linear_scaling = []
for n in n_datapoints:
    scaling_factor = n / n_datapoints[0]
    linear_scaling.append(series[n_datapoints[0]] * scaling_factor)
linear_scaling = pd.Series(linear_scaling, index=n_datapoints)
print("df allocation time")
print(pd.Series(df_allocation_times))
print("pandera scaling")
print(series)
print("linear scaling")
print(linear_scaling)

series.plot(ax=ax, logx=True, logy=True, label="pandera validation runtime")
linear_scaling.plot(ax=ax, logx=True, logy=True, label="linear scaling")
ax.set_xlabel("n datapoints")
ax.set_ylabel("seconds")
plt.legend()
plt.savefig("foo_runtimes_2.png")
plt.close(fig)

What I'm getting is:

df allocation time
10            0.001907
100           0.003104
1000          0.017631
10000         0.152901
100000        1.441880
1000000      15.270333
10000000    185.323959
dtype: float64
pandera scaling
10           0.094389
100          0.019288
1000         0.021592
10000        0.038878
100000       0.227554
1000000      3.134760
10000000    36.452018
dtype: float64
linear scaling
10              0.094389
100             0.943890
1000            9.438896
10000          94.388962
100000        943.889618
1000000      9438.896179
10000000    94388.961792
dtype: float64

@Lavi2015
Copy link
Author

Lavi2015 commented Oct 22, 2021

Hi @cosmicBboy , Thank you so much for your response and sorry for the delay.

Please find the schema as below and test data.

import pandas as pd
import pandera as pa
import os
from pandera import Check, Column, DataFrameSchema

df = pd.read_csv("test_data.csv", index_col=None, low_memory=False)
schema = pa.DataFrameSchema(
    columns={
		"col1" : Column(str, nullable=True),
		"col2" : Column(str),
		"col3" : Column(str),
		"col4" :  Column(str),
		"col5" :  Column(str),
		"col6" : Column(str, nullable=True),
		"col7" : Column(int, nullable=True),
		"col8" : Column(float, nullable=True),
		"col9" :  Column(float, nullable=True),
		"col10" : Column(str, nullable=True),
		"col11" :  Column(int, nullable=True),
		"col12" :  Column(int, nullable=True),
		"col13" : Column(str, nullable=True),
		"col14" :  Column(str, nullable=True),
		"col15" :  Column(str, nullable=True),
		"col16" :  Column(str, nullable=True),
		"col17" : Column(str, nullable=True),
		"col18" : Column(float, nullable=True),
		"col19" : Column(float, nullable=True),
		"col20" : Column(str, nullable=True),
		"col21" :  Column(float, nullable=True),
		"col22" :  Column(float, nullable=True),
		"col23" :  Column(str, nullable=True),
		"col24" : Column(float, nullable=True),
		"col25" : Column(float, nullable=True),
		"col26" : Column(float, nullable=True),
		"col27" :  Column(float, nullable=True),
		"col28" :  Column(float, nullable=True),
		"col29" : Column(str, nullable=True),
		"col30" : Column(str, nullable=True),
		"col31" : Column(float, nullable=True),
		"col32" : Column(float, nullable=True),
		"col33" : Column(float, nullable=True),
		"col34" : Column(float, nullable=True),
		"col35" :  Column(float, nullable=True),
		"col36" :  Column(float, nullable=True),
		"col37" :  Column(float, nullable=True),
		"col38" :  Column(float, nullable=True),
		"col39" :  Column(str, nullable=True),
		"col40" : Column(str, nullable=True),
		"col41" : Column(str, nullable=True),
		"col42" :  Column(float, nullable=True)				    
    strict=True,
)

try:
    schema.validate(df, lazy=True)
    print("Schema validation is completed successfully")
except pa.errors.SchemaErrors as err:
    err.failure_cases.to_csv("rejected.csv", index=False)

I also ran your script on our server but the program always hangs after a while and only partial results were generated as below. If I manage to complete, I will send all the results soon.

validating df with 10 x 50 datapoints
(10, 6)
time: 0.025631189346313477

validating df with 100 x 50 datapoints
(100, 6)
time: 0.02646636962890625

validating df with 1000 x 50 datapoints
(1000, 6)
time: 0.05078434944152832

validating df with 10000 x 50 datapoints
(10000, 6)
time: 2.2145442962646484

validating df with 100000 x 50 datapoints
(100000, 6)
time: 287.906635761261

validating df with 1000000 x 50 datapoints
test_data.csv

@cosmicBboy
Copy link
Collaborator

Hi @Lavi2015, thanks for the data and info! I'll see if I can reproduce your runtimes on my end.

From the partially completed output, it appears as though these are numbers from before the dev fix:

validating df with 100000 x 50 datapoints
(100000, 6)
time: 287.906635761261 (seconds)

vs.

100000       0.227554 (seconds)

Are you certain you have the right version installed? Can you try installing with:

pip install git+git://github.com/pandera-dev/pandera@dev

@Lavi2015
Copy link
Author

@cosmicBboy , Thanks for your response.

Apologies, please ignore my stats as I ran with the current release.
I am not able to install using pip install git+git://github.com/pandera-dev/pandera@dev due to timeout error.

I have installed as mentioned under development installation

script hangs after producing partial results as below.
validating df with 10 x 50 datapoints
(10, 6)
time: 0.02638387680053711

validating df with 100 x 50 datapoints
(100, 6)
time: 0.02531576156616211

validating df with 1000 x 50 datapoints
(1000, 6)
time: 0.05164813995361328

validating df with 10000 x 50 datapoints
(10000, 6)
time: 2.192143440246582

validating df with 100000 x 50 datapoints
(100000, 6)
time: 260.25472021102905

validating df with 1000000 x 50 datapoints

There is only slight change in the time from time: 287.906635761261 to time: 260.25472021102905

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Oct 23, 2021

hi @Lavi2015, I'll to run your example data and schema, tho I still suspect something's up with your pandera installation:

Just to triple check, did you do:

git clone https://github.com/pandera-dev/pandera.git
cd pandera
git checkout dev  # << this is important
pip install -r requirements-dev.txt
pip install -e .

@Lavi2015
Copy link
Author

Lavi2015 commented Oct 25, 2021

Hi @cosmicBboy ,
Thank you so much for providing the installation steps.
As I mentioned earlier, I followed the steps development installation

Now I reinstalled as you mentioned and followed the steps. All my testing cases gets completed below 25 seconds. I tested up to 1 million null cases in the dataset.

git clone https://github.com/pandera-dev/pandera.git
cd pandera
git checkout dev  # << this is important
pip install -r requirements-dev.txt
pip install -e .

It would be great if you could let me know about whether this fix will be available in the next release and may I also know a tentative time.

Thanks again for your timely help!

@cosmicBboy
Copy link
Collaborator

hi @Lavi2015 great! I'm planning on cutting a new minor release 0.8.0 in the next week or two, so this fix should be available with that release.

cosmicBboy added a commit that referenced this issue Nov 11, 2021
* improve lazy validation performance for nullable cases

fixes #652

This PR fixes an issue where setting `lazy=True` with a schema
where `nullable=False` and there are lot of null values causes
severe performance issues in the ~500,000 row dataframe case.

The fix is to drop duplicates when aggregating failure cases
and removing unnecessary data processing of lazily collected
failure cases.

* reintroduce sorting/dropping of duplicates
cosmicBboy added a commit that referenced this issue Nov 11, 2021
* improve lazy validation performance for nullable cases

fixes #652

This PR fixes an issue where setting `lazy=True` with a schema
where `nullable=False` and there are lot of null values causes
severe performance issues in the ~500,000 row dataframe case.

The fix is to drop duplicates when aggregating failure cases
and removing unnecessary data processing of lazily collected
failure cases.

* reintroduce sorting/dropping of duplicates
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants