Skip to content

Tuning Spark Test Performance #989

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
May 16, 2025
Merged

Conversation

abbywh
Copy link
Contributor

@abbywh abbywh commented May 10, 2025

Summary

By lowering default parallelism and keeping the metastore in memory, a few key bottlenecks around df saving and metastore bootstrapping are avoided, for example repartitionAndWrite now happens in <100ms.

Also separated out the join for better parallelism (this will only matter on a large build server due to cpu bounds) but also more cohesive testing groups/readable test files.

Lowered some partition counts on JoinTests that were much larger than other tests.

Why / Goal

Test the spark suite a bit faster/more efficiently

My local test performance shaved ~10 minutes (30->20 minutes on a large build server) and a minute or two per single test class.

Test Plan

  • [ N/A ] Added Unit Tests
  • [ N/A ] Covered by existing CI
  • [ N/A ] Integration tested

Checklist

  • [ N/A] Documentation update

Reviewers

@abbywh abbywh marked this pull request as ready for review May 12, 2025 12:47
Copy link
Collaborator

@pengyu-hou pengyu-hou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

@pengyu-hou
Copy link
Collaborator

@hzding621 how do you like the approach to split the join tests? It looks good to me.

.config("spark.default.parallelism", "2")
.config("spark.testing", "true")
.config("spark.ui.enabled", false)
.config("spark.sql.adaptive.enabled", true)
.config("spark.local.dir", s"/tmp/$userName/$name")
.config("spark.sql.warehouse.dir", s"$warehouseDir/data")
.config("spark.hadoop.javax.jdo.option.ConnectionURL", metastoreDb)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the metastoreDb here need to match the in-memory one above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or should this be removed, I guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Will change

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@nikhil-zlai nikhil-zlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very cool!!

@hzding621
Copy link
Collaborator

@hzding621 how do you like the approach to split the join tests? It looks good to me.

i like the idea 👍 the current JoinTest is too bloated to navigate

@abbywh
Copy link
Contributor Author

abbywh commented May 15, 2025

@hzding621 how do you like the approach to split the join tests? It looks good to me.

i like the idea 👍 the current JoinTest is too bloated to navigate

There's probably a more clean way to break it up than what I chose but I figured it would be a fine start

Copy link
Collaborator

@pengyu-hou pengyu-hou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last minor comment to use warn when diff count is larger than 0. Otherwise, LGMT. Thanks for the contribution.

Comment on lines +663 to +664
logger.debug(s"Diff count: ${diffCount}")
logger.debug(s"diff result rows")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let us use warn actually if the diff requires attention. it will be easier for debugging

Suggested change
logger.debug(s"Diff count: ${diffCount}")
logger.debug(s"diff result rows")
logger.warn(s"Diff count: ${diffCount}")
logger.warn(s"diff result rows")

@@ -916,22 +651,24 @@ class JoinTest {
| JOIN queries
| ON queries.item <=> part.item AND queries.ts <=> part.ts AND queries.ds <=> part.ds
|""".stripMargin)
expected.show()

if (logger.isDebugEnabled){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do we set the isDebugEnabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://stackoverflow.com/questions/963492/in-log4j-does-checking-isdebugenabled-before-logging-improve-performance just to block the computation if debug is not set, we can change this to warn as well

Comment on lines +253 to +257
logger.debug(s"Actual count: ${computed2.count()}")
logger.debug(s"Expected count: ${expected2.count()}")
logger.debug(s"Diff count: ${diff2.count()}")
logger.debug(s"Queries count: ${queries.count()}")
logger.debug(s"diff result rows")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.debug(s"Actual count: ${computed2.count()}")
logger.debug(s"Expected count: ${expected2.count()}")
logger.debug(s"Diff count: ${diff2.count()}")
logger.debug(s"Queries count: ${queries.count()}")
logger.debug(s"diff result rows")
logger.warn(s"Actual count: ${computed2.count()}")
logger.warn(s"Expected count: ${expected2.count()}")
logger.warn(s"Diff count: ${diff2.count()}")
logger.warn(s"Queries count: ${queries.count()}")
logger.warn(s"diff result rows")

Comment on lines +335 to +336
logger.debug(s"Diff count: ${diff.count()}")
logger.debug(s"diff result rows")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.debug(s"Diff count: ${diff.count()}")
logger.debug(s"diff result rows")
logger.warn(s"Diff count: ${diff.count()}")
logger.warn(s"diff result rows")

Comment on lines +413 to +414
logger.debug(s"Diff count: ${diff.count()}")
logger.debug(s"diff result rows")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.debug(s"Diff count: ${diff.count()}")
logger.debug(s"diff result rows")
logger.warn(s"Diff count: ${diff.count()}")
logger.warn(s"diff result rows")

Comment on lines +481 to +482
logger.debug(s"Diff count: ${diff.count()}")
logger.debug(s"diff result rows")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.debug(s"Diff count: ${diff.count()}")
logger.debug(s"diff result rows")
logger.warn(s"Diff count: ${diff.count()}")
logger.warn(s"diff result rows")

@pengyu-hou
Copy link
Collaborator

@abbywh let me know what you think about the above comments and I can help merge it after it has been addressed.

Co-authored-by: Pengyu Hou <[email protected]>
Signed-off-by: Abby Whittier <[email protected]>
@abbywh
Copy link
Contributor Author

abbywh commented May 15, 2025

@pengyu-hou I'll update isDebugEnabled to isWarnedEnabled shortly and it should be ready after that

@pengyu-hou
Copy link
Collaborator

isDebugEnabled

@abbywh Thanks so much. Maybe we can just keep logger.warn without isWarnEnabled? Or do you think if we can skip certain statements to speed up?

@abbywh
Copy link
Contributor Author

abbywh commented May 15, 2025

@pengyu-hou Yeah, I don't have the benchmarks (should have persisted them somewhere) but iirc it decently quick (a few seconds) but takes up a good bit of CPU power. I'm not opposed to cutting that part if you want, the spark tuning is the more significant part, but I'm not sure I understand the advantage.

@pengyu-hou
Copy link
Collaborator

@pengyu-hou Yeah, I don't have the benchmarks (should have persisted them somewhere) but iirc it decently quick (a few seconds) but takes up a good bit of CPU power. I'm not opposed to cutting that part if you want, the spark tuning is the more significant part, but I'm not sure I understand the advantage.

Ah I see what you mean. The string construction takes a good bit of CPU power. It is mostly used for debugging purpose if anything goes wrong in local or CI. But I guess it will improve a lot in CI because our CI is indeed consuming lots of CPU power. For local debugging process, we can iterate and see if we need to adjust it.

This PR is good enough. Let's merge it! Thanks a lot!

@pengyu-hou pengyu-hou merged commit 3803f85 into airbnb:main May 16, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants