Tuning Spark Test Performance #989

abbywh · 2025-05-10T15:40:59Z

Summary

By lowering default parallelism and keeping the metastore in memory, a few key bottlenecks around df saving and metastore bootstrapping are avoided, for example repartitionAndWrite now happens in <100ms.

Also separated out the join for better parallelism (this will only matter on a large build server due to cpu bounds) but also more cohesive testing groups/readable test files.

Lowered some partition counts on JoinTests that were much larger than other tests.

Why / Goal

Test the spark suite a bit faster/more efficiently

My local test performance shaved ~10 minutes (30->20 minutes on a large build server) and a minute or two per single test class.

Test Plan

[ N/A ] Added Unit Tests
[ N/A ] Covered by existing CI
[ N/A ] Integration tested

Checklist

[ N/A] Documentation update

Reviewers

pengyu-hou

Thanks for the contribution!

spark/src/main/scala/ai/chronon/spark/SparkSessionBuilder.scala

pengyu-hou · 2025-05-12T22:48:20Z

@hzding621 how do you like the approach to split the join tests? It looks good to me.

Co-authored-by: Pengyu Hou <[email protected]> Signed-off-by: Abby Whittier <[email protected]>

tswitzer-netflix · 2025-05-13T13:19:40Z

spark/src/main/scala/ai/chronon/spark/SparkSessionBuilder.scala

+        .config("spark.default.parallelism", "2")
+        .config("spark.testing", "true")
+        .config("spark.ui.enabled", false)
+        .config("spark.sql.adaptive.enabled", true)
        .config("spark.local.dir", s"/tmp/$userName/$name")
        .config("spark.sql.warehouse.dir", s"$warehouseDir/data")
        .config("spark.hadoop.javax.jdo.option.ConnectionURL", metastoreDb)


Does the metastoreDb here need to match the in-memory one above?

Or should this be removed, I guess.

+1 Will change

@tswitzer-netflix Thanks!

nikhil-zlai

very cool!!

hzding621 · 2025-05-15T07:38:58Z

@hzding621 how do you like the approach to split the join tests? It looks good to me.

i like the idea 👍 the current JoinTest is too bloated to navigate

abbywh · 2025-05-15T14:41:50Z

@hzding621 how do you like the approach to split the join tests? It looks good to me.

i like the idea 👍 the current JoinTest is too bloated to navigate

There's probably a more clean way to break it up than what I chose but I figured it would be a fine start

pengyu-hou

One last minor comment to use warn when diff count is larger than 0. Otherwise, LGMT. Thanks for the contribution.

pengyu-hou · 2025-05-15T14:43:48Z

spark/src/test/scala/ai/chronon/spark/test/JoinTest.scala

+      logger.debug(s"Diff count: ${diffCount}")
+      logger.debug(s"diff result rows")


let us use warn actually if the diff requires attention. it will be easier for debugging

Suggested change

logger.debug(s"Diff count: ${diffCount}")

logger.debug(s"diff result rows")

logger.warn(s"Diff count: ${diffCount}")

logger.warn(s"diff result rows")

pengyu-hou · 2025-05-15T14:44:05Z

spark/src/test/scala/ai/chronon/spark/test/JoinTest.scala

@@ -916,22 +651,24 @@ class JoinTest {
                                     | JOIN queries
                                     | ON queries.item <=> part.item AND queries.ts <=> part.ts AND queries.ds <=> part.ds
                                     |""".stripMargin)
-    expected.show()
-
+    if (logger.isDebugEnabled){


where do we set the isDebugEnabled?

https://stackoverflow.com/questions/963492/in-log4j-does-checking-isdebugenabled-before-logging-improve-performance just to block the computation if debug is not set, we can change this to warn as well

spark/src/test/scala/ai/chronon/spark/test/JoinBasicTest.scala

spark/src/test/scala/ai/chronon/spark/test/JoinBloomFilterTest.scala

pengyu-hou · 2025-05-15T14:48:48Z

spark/src/test/scala/ai/chronon/spark/test/JoinTest.scala

+      logger.debug(s"Actual count: ${computed2.count()}")
+      logger.debug(s"Expected count: ${expected2.count()}")
+      logger.debug(s"Diff count: ${diff2.count()}")
+      logger.debug(s"Queries count: ${queries.count()}")
+      logger.debug(s"diff result rows")


Suggested change

logger.debug(s"Actual count: ${computed2.count()}")

logger.debug(s"Expected count: ${expected2.count()}")

logger.debug(s"Diff count: ${diff2.count()}")

logger.debug(s"Queries count: ${queries.count()}")

logger.debug(s"diff result rows")

logger.warn(s"Actual count: ${computed2.count()}")

logger.warn(s"Expected count: ${expected2.count()}")

logger.warn(s"Diff count: ${diff2.count()}")

logger.warn(s"Queries count: ${queries.count()}")

logger.warn(s"diff result rows")

pengyu-hou · 2025-05-15T14:49:07Z

spark/src/test/scala/ai/chronon/spark/test/JoinTest.scala

+      logger.debug(s"Diff count: ${diff.count()}")
+      logger.debug(s"diff result rows")


Suggested change

logger.debug(s"Diff count: ${diff.count()}")

logger.debug(s"diff result rows")

logger.warn(s"Diff count: ${diff.count()}")

logger.warn(s"diff result rows")

pengyu-hou · 2025-05-15T14:49:18Z

spark/src/test/scala/ai/chronon/spark/test/JoinTest.scala

+      logger.debug(s"Diff count: ${diff.count()}")
+      logger.debug(s"diff result rows")


Suggested change

logger.debug(s"Diff count: ${diff.count()}")

logger.debug(s"diff result rows")

logger.warn(s"Diff count: ${diff.count()}")

logger.warn(s"diff result rows")

pengyu-hou · 2025-05-15T14:49:28Z

spark/src/test/scala/ai/chronon/spark/test/JoinTest.scala

+      logger.debug(s"Diff count: ${diff.count()}")
+      logger.debug(s"diff result rows")


Suggested change

logger.debug(s"Diff count: ${diff.count()}")

logger.debug(s"diff result rows")

logger.warn(s"Diff count: ${diff.count()}")

logger.warn(s"diff result rows")

pengyu-hou · 2025-05-15T14:51:38Z

@abbywh let me know what you think about the above comments and I can help merge it after it has been addressed.

Co-authored-by: Pengyu Hou <[email protected]> Signed-off-by: Abby Whittier <[email protected]>

abbywh · 2025-05-15T20:47:02Z

@pengyu-hou I'll update isDebugEnabled to isWarnedEnabled shortly and it should be ready after that

pengyu-hou · 2025-05-15T22:07:28Z

isDebugEnabled

@abbywh Thanks so much. Maybe we can just keep logger.warn without isWarnEnabled? Or do you think if we can skip certain statements to speed up?

abbywh · 2025-05-15T22:20:06Z

@pengyu-hou Yeah, I don't have the benchmarks (should have persisted them somewhere) but iirc it decently quick (a few seconds) but takes up a good bit of CPU power. I'm not opposed to cutting that part if you want, the spark tuning is the more significant part, but I'm not sure I understand the advantage.

pengyu-hou · 2025-05-16T17:07:59Z

@pengyu-hou Yeah, I don't have the benchmarks (should have persisted them somewhere) but iirc it decently quick (a few seconds) but takes up a good bit of CPU power. I'm not opposed to cutting that part if you want, the spark tuning is the more significant part, but I'm not sure I understand the advantage.

Ah I see what you mean. The string construction takes a good bit of CPU power. It is mostly used for debugging purpose if anything goes wrong in local or CI. But I guess it will improve a lot in CI because our CI is indeed consuming lots of CPU power. For local debugging process, we can iterate and see if we need to adjust it.

This PR is good enough. Let's merge it! Thanks a lot!

Abby Whittier added 6 commits May 10, 2025 15:19

tuning local spark params

0bd331e

styling

92e0e7f

scalafmt

f938f44

logging/memory changes to speed up join test

487567a

seperating tests for parallelism

42ca188

splitting jointest for parallelism

bbd5b5f

abbywh marked this pull request as ready for review May 12, 2025 12:47

pengyu-hou reviewed May 12, 2025

View reviewed changes

abbywh and others added 3 commits May 12, 2025 23:07

Update spark/src/main/scala/ai/chronon/spark/SparkSessionBuilder.scala

c4c1ccf

Co-authored-by: Pengyu Hou <[email protected]> Signed-off-by: Abby Whittier <[email protected]>

Update spark/src/main/scala/ai/chronon/spark/SparkSessionBuilder.scala

1012ee3

Co-authored-by: Pengyu Hou <[email protected]> Signed-off-by: Abby Whittier <[email protected]>

Update spark/src/main/scala/ai/chronon/spark/SparkSessionBuilder.scala

5c29378

Co-authored-by: Pengyu Hou <[email protected]> Signed-off-by: Abby Whittier <[email protected]>

tswitzer-netflix suggested changes May 13, 2025

View reviewed changes

removing redundant config

102ef7f

nikhil-zlai approved these changes May 14, 2025

View reviewed changes

formatting

94e0be8

pengyu-hou approved these changes May 15, 2025

View reviewed changes

Apply suggestions from code review

0a48faf

Co-authored-by: Pengyu Hou <[email protected]> Signed-off-by: Abby Whittier <[email protected]>

changing debug to warn

9f908de

abbywh mentioned this pull request May 15, 2025

Iceberg unit tests, support Iceberg + nonhive catalogs, Iceberg Kryo Serializer #993

Merged

Merge branch 'main' into improve_test_performance

e4a44a3

pengyu-hou merged commit 3803f85 into airbnb:main May 16, 2025
7 checks passed

		logger.debug(s"Diff count: ${diffCount}")
		logger.debug(s"diff result rows")

		logger.debug(s"Diff count: ${diff.count()}")
		logger.debug(s"diff result rows")

Tuning Spark Test Performance #989

Tuning Spark Test Performance #989

Uh oh!

Conversation

abbywh commented May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why / Goal

Test Plan

Checklist

Reviewers

Uh oh!

pengyu-hou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pengyu-hou commented May 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikhil-zlai left a comment

Choose a reason for hiding this comment

Uh oh!

hzding621 commented May 15, 2025

Uh oh!

abbywh commented May 15, 2025

Uh oh!

pengyu-hou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pengyu-hou commented May 15, 2025

Uh oh!

abbywh commented May 15, 2025

Uh oh!

pengyu-hou commented May 15, 2025

Uh oh!

abbywh commented May 15, 2025

Uh oh!

pengyu-hou commented May 16, 2025

Uh oh!

Uh oh!

Uh oh!

abbywh commented May 10, 2025 •

edited

Loading