Time taken to complete schema validation is more when the number of NULL values are more in the dataset #652

Lavi2015 · 2021-10-13T06:34:09Z

Hi,
I am trying to use pandera for schema validation and it works fine and completes within seconds most of the time.
My raw dataset is around 250MB size and dataframe shape is (594384, 42) and I am testing with nullable=False with few columns to test the performance.

When I try to run schema validation with a column where data points are null for 594338 out 594384 records, (essentially only 46 columns have data) and hence I am trying to write 594338 into rejected.csv, my script is never able to complete. I have also observed similar situations on a few other columns as testing purposes. If the number of NULL rows are more, it's taking a longer time to complete the schema check. I don't have a big data platform to test and develop and hence may not be able to try Fugue.

In another example in the aforesaid dataset, when the number of NULL values in one particular column is around 169337 (out of 594384 rows), it took almost 17 mins for pandera to complete the schema validation. I am basically doing lazy validation and segregate all failed rows into a dataframe for further evaluation.

try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
    print("Schema errors and failure cases:")
    err.failure_cases.to_csv("rejected.csv", index=False)

I am trying to automate to see if I can complete the whole process in AWS lambda function in an automated fashion. Will try to download data from S3, run lambda function for schema validation and write the rejected.csv if any into S3. Hence time taken to complete pandera schema validation is crucial.

Do you have any suggestions to optimize and complete schema validation? With my testing, I do see this taking a long time to complete only if the number of null values are more in the dataset, this could be a big issue to resolve.

Thanks and Regards

The text was updated successfully, but these errors were encountered:

cosmicBboy · 2021-10-13T13:09:52Z

Hi @Lavi2015 thanks for the question! this looks like a pretty nasty performance issue. I'm trying to reproduce the issue and it would help if you can provide me with:

an equivalent example of the schema that you're using? Ideally it would have all the data types and checks that you're trying to apply.
a representating toy dataset (written as a pandas.DataFrame literal or csv string) that I would be able to scale up with pd.concat([df] * n)

cosmicBboy · 2021-10-14T01:51:32Z

I was able to reproduce your issue with the following script:

import time

import pandas as pd
import pandera as pa
from matplotlib import pyplot as plt


schema = pa.DataFrameSchema({"foo": pa.Column(float, nullable=False)})
times = {}

n_datapoints = [10, 100, 1000, 10_000, 100_000, 1_000_000, 10_000_000]
for n in n_datapoints:
    df = pd.DataFrame({"foo": [None] * n + [1.0] * 10}).astype(float)
    start = time.time()
    try:
        print(f"validating df with {n} datapoints")
        print(df.head())
        print(df.dtypes)
        print(schema.validate(df, lazy=True))
    except Exception as e:
        print(e.failure_cases.shape)
    finally:
        runtime = time.time() - start
        print(f"time: {runtime}\n")
        times[n] = runtime

fig, ax = plt.subplots()
series = pd.Series(times)

linear_scaling = []
for n in n_datapoints:
    scaling_factor = n / n_datapoints[0]
    linear_scaling.append(series[n_datapoints[0]] * scaling_factor)
linear_scaling = pd.Series(linear_scaling, index=n_datapoints)
print(series)
print(linear_scaling)
series.plot(ax=ax, logx=True, logy=True, label="pandera validation runtime")
linear_scaling.plot(ax=ax, logx=True, logy=True, label="linear scaling")
ax.set_xlabel("n datapoints")
ax.set_ylabel("seconds")
plt.legend()
plt.savefig("foo_runtimes_2.png")
plt.close(fig)

The problem occurred right here: https://github.com/pandera-dev/pandera/blob/master/pandera/errors.py#L107-L108

Since na values are not equal to each other, the failure_cases function produced a list of all the failure cases instead of producing a list of a single na value.

I'll push up a fix for this, but for posterity, I did some basic profiling:

Before fix: Awful scaling for cases with a lot of null values

After fix: Sublinear scaling

As you can see, both runtime complexities follow an exponential curve but the scaling factor before blows up much more quickly than after.

To give a better sense of runtimes at really large numbers of datapoints (which would take too long for the before case) here's a plot of the runtime after the fix from 10 to 10^8 datapoints:

fixes #652 This PR fixes an issue where setting `lazy=True` with a schema where `nullable=False` and there are lot of null values causes severe performance issues in the ~500,000 row dataframe case. The fix is to drop duplicates when aggregating failure cases and removing unnecessary data processing of lazily collected failure cases.

* improve lazy validation performance for nullable cases fixes #652 This PR fixes an issue where setting `lazy=True` with a schema where `nullable=False` and there are lot of null values causes severe performance issues in the ~500,000 row dataframe case. The fix is to drop duplicates when aggregating failure cases and removing unnecessary data processing of lazily collected failure cases. * reintroduce sorting/dropping of duplicates

Lavi2015 · 2021-10-14T06:39:35Z

Hi @cosmicBboy , Thanks for your response and appreciate the details.
Please let me know when this fix will be pushed. I will test and update here. Thx again!

cosmicBboy · 2021-10-14T14:24:41Z

Hi @Lavi2015 the fix #655 should be available now on the dev branch. Please check it out via a development installation (make sure to checkout the dev branch) and let me know if that fixes your issue!

Lavi2015 · 2021-10-20T11:00:47Z

Hi @cosmicBboy , Thank you so much for the release.
I installed the dev version and completed the testing for the same dataset as mentioned earlier:

In another example in the aforesaid dataset, when the number of NULL values in one particular column is around 169337 (out of 594384 rows), it took almost 17 mins for pandera to complete the schema validation. I am basically doing lazy validation and segregate all failed rows into a dataframe for further evaluation.

With the latest dev version, it's taking around 15 mins to complete the schema check. Please let me know if you need any further details. Thanks again!

cosmicBboy · 2021-10-20T14:51:50Z

To see if there's any way to speed things up, can you please provide:

an equivalent example of the schema that you're using, ideally it would have all the data types and checks that you're trying to apply.
a representative toy dataset (written as a pandas.DataFrame literal or csv string) that I would be able to scale up with pd.concat([df] * n) that produces the same runtime behavior

On your side, just to make sure we have similar runtimes, could you run the code below and copy-paste the output here?

import time

import pandas as pd
import pandera as pa
from matplotlib import pyplot as plt


schema = pa.DataFrameSchema({
    "\d+": pa.Column(float, nullable=False, regex=True)
})
times = {}
df_allocation_times = {}

n_datapoints = [10, 100, 1000, 10_000, 100_000, 1_000_000, 10_000_000]
for n in n_datapoints:
    start = time.time()
    df = pd.DataFrame(
        {
            i: [None] * n + [1.0] * 10
            # 50 columns, n + 10 rows
            for i in range(50)
        }
    ).astype(float)
    df_allocation_times[n] = time.time() - start

    start = time.time()
    try:
        print(f"validating df with {n} x 50 datapoints")
        print(schema.validate(df, lazy=True))
    except Exception as exc:
        print(exc.failure_cases.shape)
    finally:
        runtime = time.time() - start
        print(f"time: {runtime}\n")
        times[n] = runtime

fig, ax = plt.subplots()
series = pd.Series(times)

linear_scaling = []
for n in n_datapoints:
    scaling_factor = n / n_datapoints[0]
    linear_scaling.append(series[n_datapoints[0]] * scaling_factor)
linear_scaling = pd.Series(linear_scaling, index=n_datapoints)
print("df allocation time")
print(pd.Series(df_allocation_times))
print("pandera scaling")
print(series)
print("linear scaling")
print(linear_scaling)

series.plot(ax=ax, logx=True, logy=True, label="pandera validation runtime")
linear_scaling.plot(ax=ax, logx=True, logy=True, label="linear scaling")
ax.set_xlabel("n datapoints")
ax.set_ylabel("seconds")
plt.legend()
plt.savefig("foo_runtimes_2.png")
plt.close(fig)

What I'm getting is:

df allocation time
10            0.001907
100           0.003104
1000          0.017631
10000         0.152901
100000        1.441880
1000000      15.270333
10000000    185.323959
dtype: float64
pandera scaling
10           0.094389
100          0.019288
1000         0.021592
10000        0.038878
100000       0.227554
1000000      3.134760
10000000    36.452018
dtype: float64
linear scaling
10              0.094389
100             0.943890
1000            9.438896
10000          94.388962
100000        943.889618
1000000      9438.896179
10000000    94388.961792
dtype: float64

Lavi2015 · 2021-10-22T10:16:21Z

Hi @cosmicBboy , Thank you so much for your response and sorry for the delay.

Please find the schema as below and test data.

import pandas as pd
import pandera as pa
import os
from pandera import Check, Column, DataFrameSchema

df = pd.read_csv("test_data.csv", index_col=None, low_memory=False)
schema = pa.DataFrameSchema(
    columns={
		"col1" : Column(str, nullable=True),
		"col2" : Column(str),
		"col3" : Column(str),
		"col4" :  Column(str),
		"col5" :  Column(str),
		"col6" : Column(str, nullable=True),
		"col7" : Column(int, nullable=True),
		"col8" : Column(float, nullable=True),
		"col9" :  Column(float, nullable=True),
		"col10" : Column(str, nullable=True),
		"col11" :  Column(int, nullable=True),
		"col12" :  Column(int, nullable=True),
		"col13" : Column(str, nullable=True),
		"col14" :  Column(str, nullable=True),
		"col15" :  Column(str, nullable=True),
		"col16" :  Column(str, nullable=True),
		"col17" : Column(str, nullable=True),
		"col18" : Column(float, nullable=True),
		"col19" : Column(float, nullable=True),
		"col20" : Column(str, nullable=True),
		"col21" :  Column(float, nullable=True),
		"col22" :  Column(float, nullable=True),
		"col23" :  Column(str, nullable=True),
		"col24" : Column(float, nullable=True),
		"col25" : Column(float, nullable=True),
		"col26" : Column(float, nullable=True),
		"col27" :  Column(float, nullable=True),
		"col28" :  Column(float, nullable=True),
		"col29" : Column(str, nullable=True),
		"col30" : Column(str, nullable=True),
		"col31" : Column(float, nullable=True),
		"col32" : Column(float, nullable=True),
		"col33" : Column(float, nullable=True),
		"col34" : Column(float, nullable=True),
		"col35" :  Column(float, nullable=True),
		"col36" :  Column(float, nullable=True),
		"col37" :  Column(float, nullable=True),
		"col38" :  Column(float, nullable=True),
		"col39" :  Column(str, nullable=True),
		"col40" : Column(str, nullable=True),
		"col41" : Column(str, nullable=True),
		"col42" :  Column(float, nullable=True)				    
    strict=True,
)

try:
    schema.validate(df, lazy=True)
    print("Schema validation is completed successfully")
except pa.errors.SchemaErrors as err:
    err.failure_cases.to_csv("rejected.csv", index=False)

I also ran your script on our server but the program always hangs after a while and only partial results were generated as below. If I manage to complete, I will send all the results soon.

validating df with 10 x 50 datapoints
(10, 6)
time: 0.025631189346313477

validating df with 100 x 50 datapoints
(100, 6)
time: 0.02646636962890625

validating df with 1000 x 50 datapoints
(1000, 6)
time: 0.05078434944152832

validating df with 10000 x 50 datapoints
(10000, 6)
time: 2.2145442962646484

validating df with 100000 x 50 datapoints
(100000, 6)
time: 287.906635761261

validating df with 1000000 x 50 datapoints
test_data.csv

cosmicBboy · 2021-10-22T16:17:15Z

Hi @Lavi2015, thanks for the data and info! I'll see if I can reproduce your runtimes on my end.

From the partially completed output, it appears as though these are numbers from before the dev fix:

validating df with 100000 x 50 datapoints
(100000, 6)
time: 287.906635761261 (seconds)

vs.

100000       0.227554 (seconds)

Are you certain you have the right version installed? Can you try installing with:

pip install git+git://github.com/pandera-dev/pandera@dev

Lavi2015 · 2021-10-23T02:29:58Z

@cosmicBboy , Thanks for your response.

Apologies, please ignore my stats as I ran with the current release.
I am not able to install using pip install git+git://github.com/pandera-dev/pandera@dev due to timeout error.

I have installed as mentioned under development installation

script hangs after producing partial results as below.
validating df with 10 x 50 datapoints
(10, 6)
time: 0.02638387680053711

validating df with 100 x 50 datapoints
(100, 6)
time: 0.02531576156616211

validating df with 1000 x 50 datapoints
(1000, 6)
time: 0.05164813995361328

validating df with 10000 x 50 datapoints
(10000, 6)
time: 2.192143440246582

validating df with 100000 x 50 datapoints
(100000, 6)
time: 260.25472021102905

validating df with 1000000 x 50 datapoints

There is only slight change in the time from time: 287.906635761261 to time: 260.25472021102905

cosmicBboy · 2021-10-23T20:47:42Z

hi @Lavi2015, I'll to run your example data and schema, tho I still suspect something's up with your pandera installation:

Just to triple check, did you do:

git clone https://github.com/pandera-dev/pandera.git
cd pandera
git checkout dev  # << this is important
pip install -r requirements-dev.txt
pip install -e .

Lavi2015 · 2021-10-25T03:45:27Z

Hi @cosmicBboy ,
Thank you so much for providing the installation steps.
As I mentioned earlier, I followed the steps development installation

Now I reinstalled as you mentioned and followed the steps. All my testing cases gets completed below 25 seconds. I tested up to 1 million null cases in the dataset.

git clone https://github.com/pandera-dev/pandera.git
cd pandera
git checkout dev  # << this is important
pip install -r requirements-dev.txt
pip install -e .

It would be great if you could let me know about whether this fix will be available in the next release and may I also know a tentative time.

Thanks again for your timely help!

cosmicBboy · 2021-10-25T12:53:39Z

hi @Lavi2015 great! I'm planning on cutting a new minor release 0.8.0 in the next week or two, so this fix should be available with that release.

* improve lazy validation performance for nullable cases fixes #652 This PR fixes an issue where setting `lazy=True` with a schema where `nullable=False` and there are lot of null values causes severe performance issues in the ~500,000 row dataframe case. The fix is to drop duplicates when aggregating failure cases and removing unnecessary data processing of lazily collected failure cases. * reintroduce sorting/dropping of duplicates

Lavi2015 added the question Further information is requested label Oct 13, 2021

cosmicBboy mentioned this issue Oct 14, 2021

improve lazy validation performance for nullable cases #655

Merged

cosmicBboy closed this as completed Oct 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Time taken to complete schema validation is more when the number of NULL values are more in the dataset #652

Time taken to complete schema validation is more when the number of NULL values are more in the dataset #652

Lavi2015 commented Oct 13, 2021 •

edited

Loading

cosmicBboy commented Oct 13, 2021

Uh oh!

cosmicBboy commented Oct 14, 2021

Uh oh!

Lavi2015 commented Oct 14, 2021 •

edited

Loading

Uh oh!

cosmicBboy commented Oct 14, 2021 •

edited

Loading

Uh oh!

Lavi2015 commented Oct 20, 2021 •

edited

Loading

Uh oh!

cosmicBboy commented Oct 20, 2021

Uh oh!

Lavi2015 commented Oct 22, 2021 •

edited

Loading

Uh oh!

cosmicBboy commented Oct 22, 2021

Uh oh!

Lavi2015 commented Oct 23, 2021

Uh oh!

cosmicBboy commented Oct 23, 2021 •

edited

Loading

Uh oh!

Lavi2015 commented Oct 25, 2021 •

edited

Loading

Uh oh!

cosmicBboy commented Oct 25, 2021

Uh oh!

Uh oh!

Time taken to complete schema validation is more when the number of NULL values are more in the dataset #652

Time taken to complete schema validation is more when the number of NULL values are more in the dataset #652

Comments

Lavi2015 commented Oct 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

cosmicBboy commented Oct 13, 2021

Uh oh!

cosmicBboy commented Oct 14, 2021

Before fix: Awful scaling for cases with a lot of null values

After fix: Sublinear scaling

Uh oh!

Lavi2015 commented Oct 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cosmicBboy commented Oct 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lavi2015 commented Oct 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cosmicBboy commented Oct 20, 2021

Uh oh!

Lavi2015 commented Oct 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cosmicBboy commented Oct 22, 2021

Uh oh!

Lavi2015 commented Oct 23, 2021

Uh oh!

cosmicBboy commented Oct 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lavi2015 commented Oct 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cosmicBboy commented Oct 25, 2021

Uh oh!

Lavi2015 commented Oct 13, 2021 •

edited

Loading

Lavi2015 commented Oct 14, 2021 •

edited

Loading

cosmicBboy commented Oct 14, 2021 •

edited

Loading

Lavi2015 commented Oct 20, 2021 •

edited

Loading

Lavi2015 commented Oct 22, 2021 •

edited

Loading

cosmicBboy commented Oct 23, 2021 •

edited

Loading

Lavi2015 commented Oct 25, 2021 •

edited

Loading