-
Notifications
You must be signed in to change notification settings - Fork 70
Tuning Spark Test Performance #989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution!
spark/src/main/scala/ai/chronon/spark/SparkSessionBuilder.scala
Outdated
Show resolved
Hide resolved
spark/src/main/scala/ai/chronon/spark/SparkSessionBuilder.scala
Outdated
Show resolved
Hide resolved
spark/src/main/scala/ai/chronon/spark/SparkSessionBuilder.scala
Outdated
Show resolved
Hide resolved
@hzding621 how do you like the approach to split the join tests? It looks good to me. |
Co-authored-by: Pengyu Hou <[email protected]> Signed-off-by: Abby Whittier <[email protected]>
Co-authored-by: Pengyu Hou <[email protected]> Signed-off-by: Abby Whittier <[email protected]>
Co-authored-by: Pengyu Hou <[email protected]> Signed-off-by: Abby Whittier <[email protected]>
.config("spark.default.parallelism", "2") | ||
.config("spark.testing", "true") | ||
.config("spark.ui.enabled", false) | ||
.config("spark.sql.adaptive.enabled", true) | ||
.config("spark.local.dir", s"/tmp/$userName/$name") | ||
.config("spark.sql.warehouse.dir", s"$warehouseDir/data") | ||
.config("spark.hadoop.javax.jdo.option.ConnectionURL", metastoreDb) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the metastoreDb here need to match the in-memory one above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or should this be removed, I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Will change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tswitzer-netflix Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very cool!!
i like the idea 👍 the current |
There's probably a more clean way to break it up than what I chose but I figured it would be a fine start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last minor comment to use warn
when diff count is larger than 0. Otherwise, LGMT. Thanks for the contribution.
logger.debug(s"Diff count: ${diffCount}") | ||
logger.debug(s"diff result rows") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let us use warn
actually if the diff requires attention. it will be easier for debugging
logger.debug(s"Diff count: ${diffCount}") | |
logger.debug(s"diff result rows") | |
logger.warn(s"Diff count: ${diffCount}") | |
logger.warn(s"diff result rows") |
@@ -916,22 +651,24 @@ class JoinTest { | |||
| JOIN queries | |||
| ON queries.item <=> part.item AND queries.ts <=> part.ts AND queries.ds <=> part.ds | |||
|""".stripMargin) | |||
expected.show() | |||
|
|||
if (logger.isDebugEnabled){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where do we set the isDebugEnabled
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://stackoverflow.com/questions/963492/in-log4j-does-checking-isdebugenabled-before-logging-improve-performance just to block the computation if debug is not set, we can change this to warn as well
spark/src/test/scala/ai/chronon/spark/test/JoinBloomFilterTest.scala
Outdated
Show resolved
Hide resolved
logger.debug(s"Actual count: ${computed2.count()}") | ||
logger.debug(s"Expected count: ${expected2.count()}") | ||
logger.debug(s"Diff count: ${diff2.count()}") | ||
logger.debug(s"Queries count: ${queries.count()}") | ||
logger.debug(s"diff result rows") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.debug(s"Actual count: ${computed2.count()}") | |
logger.debug(s"Expected count: ${expected2.count()}") | |
logger.debug(s"Diff count: ${diff2.count()}") | |
logger.debug(s"Queries count: ${queries.count()}") | |
logger.debug(s"diff result rows") | |
logger.warn(s"Actual count: ${computed2.count()}") | |
logger.warn(s"Expected count: ${expected2.count()}") | |
logger.warn(s"Diff count: ${diff2.count()}") | |
logger.warn(s"Queries count: ${queries.count()}") | |
logger.warn(s"diff result rows") |
logger.debug(s"Diff count: ${diff.count()}") | ||
logger.debug(s"diff result rows") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.debug(s"Diff count: ${diff.count()}") | |
logger.debug(s"diff result rows") | |
logger.warn(s"Diff count: ${diff.count()}") | |
logger.warn(s"diff result rows") |
logger.debug(s"Diff count: ${diff.count()}") | ||
logger.debug(s"diff result rows") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.debug(s"Diff count: ${diff.count()}") | |
logger.debug(s"diff result rows") | |
logger.warn(s"Diff count: ${diff.count()}") | |
logger.warn(s"diff result rows") |
logger.debug(s"Diff count: ${diff.count()}") | ||
logger.debug(s"diff result rows") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.debug(s"Diff count: ${diff.count()}") | |
logger.debug(s"diff result rows") | |
logger.warn(s"Diff count: ${diff.count()}") | |
logger.warn(s"diff result rows") |
@abbywh let me know what you think about the above comments and I can help merge it after it has been addressed. |
Co-authored-by: Pengyu Hou <[email protected]> Signed-off-by: Abby Whittier <[email protected]>
@pengyu-hou I'll update isDebugEnabled to isWarnedEnabled shortly and it should be ready after that |
@abbywh Thanks so much. Maybe we can just keep |
@pengyu-hou Yeah, I don't have the benchmarks (should have persisted them somewhere) but iirc it decently quick (a few seconds) but takes up a good bit of CPU power. I'm not opposed to cutting that part if you want, the spark tuning is the more significant part, but I'm not sure I understand the advantage. |
Ah I see what you mean. The string construction takes a good bit of CPU power. It is mostly used for debugging purpose if anything goes wrong in local or CI. But I guess it will improve a lot in CI because our CI is indeed consuming lots of CPU power. For local debugging process, we can iterate and see if we need to adjust it. This PR is good enough. Let's merge it! Thanks a lot! |
Summary
By lowering default parallelism and keeping the metastore in memory, a few key bottlenecks around df saving and metastore bootstrapping are avoided, for example
repartitionAndWrite
now happens in <100ms.Also separated out the join for better parallelism (this will only matter on a large build server due to cpu bounds) but also more cohesive testing groups/readable test files.
Lowered some partition counts on JoinTests that were much larger than other tests.
Why / Goal
Test the spark suite a bit faster/more efficiently
My local test performance shaved ~10 minutes (30->20 minutes on a large build server) and a minute or two per single test class.
Test Plan
Checklist
Reviewers