-
-
Notifications
You must be signed in to change notification settings - Fork 344
Time taken to complete schema validation is more when the number of NULL values are more in the dataset #652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @Lavi2015 thanks for the question! this looks like a pretty nasty performance issue. I'm trying to reproduce the issue and it would help if you can provide me with:
|
I was able to reproduce your issue with the following script: import time
import pandas as pd
import pandera as pa
from matplotlib import pyplot as plt
schema = pa.DataFrameSchema({"foo": pa.Column(float, nullable=False)})
times = {}
n_datapoints = [10, 100, 1000, 10_000, 100_000, 1_000_000, 10_000_000]
for n in n_datapoints:
df = pd.DataFrame({"foo": [None] * n + [1.0] * 10}).astype(float)
start = time.time()
try:
print(f"validating df with {n} datapoints")
print(df.head())
print(df.dtypes)
print(schema.validate(df, lazy=True))
except Exception as e:
print(e.failure_cases.shape)
finally:
runtime = time.time() - start
print(f"time: {runtime}\n")
times[n] = runtime
fig, ax = plt.subplots()
series = pd.Series(times)
linear_scaling = []
for n in n_datapoints:
scaling_factor = n / n_datapoints[0]
linear_scaling.append(series[n_datapoints[0]] * scaling_factor)
linear_scaling = pd.Series(linear_scaling, index=n_datapoints)
print(series)
print(linear_scaling)
series.plot(ax=ax, logx=True, logy=True, label="pandera validation runtime")
linear_scaling.plot(ax=ax, logx=True, logy=True, label="linear scaling")
ax.set_xlabel("n datapoints")
ax.set_ylabel("seconds")
plt.legend()
plt.savefig("foo_runtimes_2.png")
plt.close(fig) The problem occurred right here: https://github.com/pandera-dev/pandera/blob/master/pandera/errors.py#L107-L108 Since I'll push up a fix for this, but for posterity, I did some basic profiling: Before fix: Awful scaling for cases with a lot of null valuesAfter fix: Sublinear scalingAs you can see, both runtime complexities follow an exponential curve but the scaling factor To give a better sense of runtimes at really large numbers of datapoints (which would take too long for the |
fixes #652 This PR fixes an issue where setting `lazy=True` with a schema where `nullable=False` and there are lot of null values causes severe performance issues in the ~500,000 row dataframe case. The fix is to drop duplicates when aggregating failure cases and removing unnecessary data processing of lazily collected failure cases.
* improve lazy validation performance for nullable cases fixes #652 This PR fixes an issue where setting `lazy=True` with a schema where `nullable=False` and there are lot of null values causes severe performance issues in the ~500,000 row dataframe case. The fix is to drop duplicates when aggregating failure cases and removing unnecessary data processing of lazily collected failure cases. * reintroduce sorting/dropping of duplicates
Hi @cosmicBboy , Thanks for your response and appreciate the details. |
Hi @Lavi2015 the fix #655 should be available now on the |
Hi @cosmicBboy , Thank you so much for the release.
With the latest dev version, it's taking around 15 mins to complete the schema check. Please let me know if you need any further details. Thanks again! |
To see if there's any way to speed things up, can you please provide:
On your side, just to make sure we have similar runtimes, could you run the code below and copy-paste the output here? import time
import pandas as pd
import pandera as pa
from matplotlib import pyplot as plt
schema = pa.DataFrameSchema({
"\d+": pa.Column(float, nullable=False, regex=True)
})
times = {}
df_allocation_times = {}
n_datapoints = [10, 100, 1000, 10_000, 100_000, 1_000_000, 10_000_000]
for n in n_datapoints:
start = time.time()
df = pd.DataFrame(
{
i: [None] * n + [1.0] * 10
# 50 columns, n + 10 rows
for i in range(50)
}
).astype(float)
df_allocation_times[n] = time.time() - start
start = time.time()
try:
print(f"validating df with {n} x 50 datapoints")
print(schema.validate(df, lazy=True))
except Exception as exc:
print(exc.failure_cases.shape)
finally:
runtime = time.time() - start
print(f"time: {runtime}\n")
times[n] = runtime
fig, ax = plt.subplots()
series = pd.Series(times)
linear_scaling = []
for n in n_datapoints:
scaling_factor = n / n_datapoints[0]
linear_scaling.append(series[n_datapoints[0]] * scaling_factor)
linear_scaling = pd.Series(linear_scaling, index=n_datapoints)
print("df allocation time")
print(pd.Series(df_allocation_times))
print("pandera scaling")
print(series)
print("linear scaling")
print(linear_scaling)
series.plot(ax=ax, logx=True, logy=True, label="pandera validation runtime")
linear_scaling.plot(ax=ax, logx=True, logy=True, label="linear scaling")
ax.set_xlabel("n datapoints")
ax.set_ylabel("seconds")
plt.legend()
plt.savefig("foo_runtimes_2.png")
plt.close(fig) What I'm getting is:
|
Hi @cosmicBboy , Thank you so much for your response and sorry for the delay. Please find the schema as below and test data.
I also ran your script on our server but the program always hangs after a while and only partial results were generated as below. If I manage to complete, I will send all the results soon. validating df with 10 x 50 datapoints validating df with 100 x 50 datapoints validating df with 1000 x 50 datapoints validating df with 10000 x 50 datapoints validating df with 100000 x 50 datapoints validating df with 1000000 x 50 datapoints |
Hi @Lavi2015, thanks for the data and info! I'll see if I can reproduce your runtimes on my end. From the partially completed output, it appears as though these are numbers from before the dev fix:
vs.
Are you certain you have the right version installed? Can you try installing with:
|
@cosmicBboy , Thanks for your response. Apologies, please ignore my stats as I ran with the current release. I have installed as mentioned under development installation script hangs after producing partial results as below. validating df with 100 x 50 datapoints validating df with 1000 x 50 datapoints validating df with 10000 x 50 datapoints validating df with 100000 x 50 datapoints validating df with 1000000 x 50 datapoints There is only slight change in the time from |
hi @Lavi2015, I'll to run your example data and schema, tho I still suspect something's up with your pandera installation: Just to triple check, did you do:
|
Hi @cosmicBboy , Now I reinstalled as you mentioned and followed the steps. All my testing cases gets completed below 25 seconds. I tested up to 1 million null cases in the dataset.
It would be great if you could let me know about whether this fix will be available in the next release and may I also know a tentative time. Thanks again for your timely help! |
hi @Lavi2015 great! I'm planning on cutting a new minor release |
* improve lazy validation performance for nullable cases fixes #652 This PR fixes an issue where setting `lazy=True` with a schema where `nullable=False` and there are lot of null values causes severe performance issues in the ~500,000 row dataframe case. The fix is to drop duplicates when aggregating failure cases and removing unnecessary data processing of lazily collected failure cases. * reintroduce sorting/dropping of duplicates
* improve lazy validation performance for nullable cases fixes #652 This PR fixes an issue where setting `lazy=True` with a schema where `nullable=False` and there are lot of null values causes severe performance issues in the ~500,000 row dataframe case. The fix is to drop duplicates when aggregating failure cases and removing unnecessary data processing of lazily collected failure cases. * reintroduce sorting/dropping of duplicates
Uh oh!
There was an error while loading. Please reload this page.
Hi,
I am trying to use pandera for schema validation and it works fine and completes within seconds most of the time.
My raw dataset is around 250MB size and dataframe shape is (594384, 42) and I am testing with nullable=False with few columns to test the performance.
When I try to run schema validation with a column where data points are null for 594338 out 594384 records, (essentially only 46 columns have data) and hence I am trying to write 594338 into rejected.csv, my script is never able to complete. I have also observed similar situations on a few other columns as testing purposes. If the number of NULL rows are more, it's taking a longer time to complete the schema check. I don't have a big data platform to test and develop and hence may not be able to try Fugue.
In another example in the aforesaid dataset, when the number of NULL values in one particular column is around 169337 (out of 594384 rows), it took almost 17 mins for pandera to complete the schema validation. I am basically doing lazy validation and segregate all failed rows into a dataframe for further evaluation.
I am trying to automate to see if I can complete the whole process in AWS lambda function in an automated fashion. Will try to download data from S3, run lambda function for schema validation and write the rejected.csv if any into S3. Hence time taken to complete pandera schema validation is crucial.
Do you have any suggestions to optimize and complete schema validation? With my testing, I do see this taking a long time to complete only if the number of null values are more in the dataset, this could be a big issue to resolve.
Thanks and Regards
The text was updated successfully, but these errors were encountered: