-
Notifications
You must be signed in to change notification settings - Fork 0
feat: basic hudi support #437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe changes consolidate dependency declarations in the Changes
Suggested reviewers
Poem
Warning Review ran into problems🔥 ProblemsGitHub Actions and Pipeline Checks: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository. Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings. 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🧰 Additional context used🧠 Learnings (1)spark/src/main/scala/ai/chronon/spark/ChrononKryoRegistrator.scala (1)
⏰ Context from checks skipped due to timeout of 90000ms (17)
🔇 Additional comments (2)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need any updates to the table format providers and friends?
Yep absolutely - will do that in a followup. This is just to get the jars here so we can run unit tests. |
6adc9ee
to
de5cf0e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
cloud_aws/src/test/scala/ai/chronon/integrations/aws/GlueCatalogTest.scala (1)
10-36
: Test structure looks good but consider adding cleanup.The test correctly verifies basic Hudi functionality with local metastore, but doesn't clean up the test table after execution.
Consider adding cleanup code to remove the test table after verification:
"basic round trip hudi table" should "work with local metastore" in { import spark.implicits._ val input = Set(1,2,3,4) val sourceDF = spark.sparkContext.parallelize(input.toSeq).toDF("id") sourceDF.write.format("hudi") .mode(SaveMode.Overwrite) .saveAsTable("test_hudi_table") val back = spark.table("test_hudi_table").select("id").as[Int].collect() assertEquals(input, back.toSet) + // Clean up + spark.sql("DROP TABLE IF EXISTS test_hudi_table") }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (6)
cloud_aws/BUILD.bazel
(2 hunks)cloud_aws/src/test/scala/ai/chronon/integrations/aws/GlueCatalogTest.scala
(1 hunks)spark/src/main/scala/ai/chronon/spark/ChrononKryoRegistrator.scala
(1 hunks)tools/build_rules/dependencies/maven_repository.bzl
(3 hunks)tools/build_rules/dependencies/scala_2_12_repository.bzl
(1 hunks)tools/build_rules/dependencies/scala_2_13_repository.bzl
(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
spark/src/main/scala/ai/chronon/spark/ChrononKryoRegistrator.scala (1)
Learnt from: nikhil-zlai
PR: zipline-ai/chronon#51
File: spark/src/main/scala/ai/chronon/spark/ChrononKryoRegistrator.scala:192-200
Timestamp: 2024-11-26T19:47:53.900Z
Learning: Only suggest registering Delta Lake action classes for serialization if they are actually used in the codebase.
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: scala_compile_fmt_fix
- GitHub Check: streaming_tests
- GitHub Check: spark_tests
- GitHub Check: non_spark_tests
- GitHub Check: groupby_tests
- GitHub Check: analyzer_tests
- GitHub Check: join_tests
- GitHub Check: fetcher_tests
🔇 Additional comments (11)
tools/build_rules/dependencies/scala_2_12_repository.bzl (1)
41-41
: Adding Hudi dependency looks good.The added dependency for Hudi Spark bundle is properly placed and will enable Hudi support for Scala 2.12.
tools/build_rules/dependencies/scala_2_13_repository.bzl (1)
41-41
: Adding Hudi dependency looks good.The added dependency for Hudi Spark bundle is properly placed and will enable Hudi support for Scala 2.13.
spark/src/main/scala/ai/chronon/spark/ChrononKryoRegistrator.scala (1)
173-188
: Kryo registrations for Hudi classes look good.All necessary Hudi model and utility classes are registered for proper serialization. Well organized.
cloud_aws/BUILD.bazel (5)
1-20
: Good refactoring of dependencies.Centralizing shared dependencies improves maintainability.
30-30
: Clean dependency reference.Using shared_libs variable simplifies maintenance.
49-49
: Simplified test dependencies structure.Combining shared_libs with test-specific dependencies provides clear organization.
13-13
:❓ Verification inconclusive
Confirm Hudi Spark bundle compatibility.
Verify Spark version compatibility with this Hudi bundle.
🏁 Script executed:
#!/bin/bash # Look for Spark version declarations rg -A 1 "spark.*version" --type bazelLength of output: 39
Action: Manually verify Spark version compatibility.
The search didn't reveal any Spark version declarations. Please confirm that the Hudi Spark3 bundle incloud_aws/BUILD.bazel
is compatible with your project's Spark version by checking other config files or documentation.
9-9
:❓ Verification inconclusive
Verify Hudi AWS bundle compatibility.
Ensure this version is compatible with your AWS environment.
🏁 Script executed:
#!/bin/bash # Check for Hudi compatibility issues rg -A 2 -B 2 "hudi" --type scalaLength of output: 4518
Action: Validate the Hudi AWS bundle version
- In
cloud_aws/BUILD.bazel
(line 9), the Maven artifact fororg.apache.hudi:hudi-aws-bundle
is declared.- The tests in
GlueCatalogTest.scala
and the registrations inChrononKryoRegistrator.scala
confirm Hudi usage.- Please verify that the specified bundle version is fully compatible with our AWS integrations.
tools/build_rules/dependencies/maven_repository.bzl (3)
27-27
: Performance enhancement with jackson-module-afterburner.This module improves Jackson serialization performance.
47-47
: Core dependency for Hudi AWS support.This bundle is essential for the PR objective.
131-132
:❓ Verification inconclusive
Dependency exclusions may need documentation.
Clarify why parquet-avro and zookeeper are excluded.
🏁 Script executed:
#!/bin/bash # Check for potential transitive dependencies that might include these rg -A 5 "parquet|zookeeper" --type scalaLength of output: 25547
Clarify exclusion rationale in maven_repository.bzl
Please add a brief comment near lines 131–132 intools/build_rules/dependencies/maven_repository.bzl
explaining why these two exclusions—org.apache.parquet:parquet-avro
andorg.apache.zookeeper:zookeeper
—are necessary (e.g. to avoid transitive dependency conflicts or reduce unnecessary overhead). The grep results indicate that "parquet" is widely used in our codebase, so noting the specific conflict or legacy issue would be very helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
spark/src/main/scala/ai/chronon/spark/ChrononKryoRegistrator.scala (1)
197-214
: Rename variable to match content typeVariable name refers to Delta but contains Hudi classes.
- val additionalDeltaNames = Seq( + val additionalHudiNames = Seq(
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (2)
cloud_aws/src/test/scala/ai/chronon/integrations/aws/GlueCatalogTest.scala
(1 hunks)spark/src/main/scala/ai/chronon/spark/ChrononKryoRegistrator.scala
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cloud_aws/src/test/scala/ai/chronon/integrations/aws/GlueCatalogTest.scala
🧰 Additional context used
🧠 Learnings (1)
spark/src/main/scala/ai/chronon/spark/ChrononKryoRegistrator.scala (1)
Learnt from: nikhil-zlai
PR: zipline-ai/chronon#51
File: spark/src/main/scala/ai/chronon/spark/ChrononKryoRegistrator.scala:192-200
Timestamp: 2024-11-26T19:47:53.900Z
Learning: Only suggest registering Delta Lake action classes for serialization if they are actually used in the codebase.
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: streaming_tests
- GitHub Check: spark_tests
- GitHub Check: fetcher_tests
- GitHub Check: analyzer_tests
- GitHub Check: join_tests
- GitHub Check: groupby_tests
- GitHub Check: scala_compile_fmt_fix
- GitHub Check: non_spark_tests
🔇 Additional comments (2)
spark/src/main/scala/ai/chronon/spark/ChrononKryoRegistrator.scala (2)
172-173
: LGTM!Added LongManifest for serialization.
194-217
: LGTM!Clean implementation of Hudi Kryo registrator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
spark/src/main/scala/ai/chronon/spark/ChrononKryoRegistrator.scala (1)
194-215
: Rename variable for clarity.Variable name should reflect Hudi context.
- val additionalDeltaNames = Seq( + val additionalHudiNames = Seq(
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (2)
cloud_aws/src/test/scala/ai/chronon/integrations/aws/GlueCatalogTest.scala
(1 hunks)spark/src/main/scala/ai/chronon/spark/ChrononKryoRegistrator.scala
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cloud_aws/src/test/scala/ai/chronon/integrations/aws/GlueCatalogTest.scala
🧰 Additional context used
🧠 Learnings (1)
spark/src/main/scala/ai/chronon/spark/ChrononKryoRegistrator.scala (1)
Learnt from: nikhil-zlai
PR: zipline-ai/chronon#51
File: spark/src/main/scala/ai/chronon/spark/ChrononKryoRegistrator.scala:192-200
Timestamp: 2024-11-26T19:47:53.900Z
Learning: Only suggest registering Delta Lake action classes for serialization if they are actually used in the codebase.
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: streaming_tests
- GitHub Check: groupby_tests
- GitHub Check: analyzer_tests
- GitHub Check: spark_tests
- GitHub Check: fetcher_tests
- GitHub Check: join_tests
- GitHub Check: non_spark_tests
- GitHub Check: bazel_config_tests
- GitHub Check: scala_compile_fmt_fix
🔇 Additional comments (2)
spark/src/main/scala/ai/chronon/spark/ChrononKryoRegistrator.scala (2)
172-173
: LGTM: Addition of LongManifest class.Adding this class ensures complete Kryo serialization coverage.
194-217
: LGTM: Well-structured Hudi serialization support.Follows established pattern from DeltaLake implementation.
6d7a096
to
5fe3159
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
spark/src/main/scala/ai/chronon/spark/ChrononKryoRegistrator.scala (1)
172-173
: Duplicate registration detected.
scala.reflect.ManifestFactory$LongManifest
is already registered at line 159."org.apache.datasketches.kll.KllHeapFloatsSketch", "org.apache.datasketches.kll.KllSketch$SketchStructure", "org.apache.datasketches.kll.KllSketch$SketchType", -"scala.reflect.ManifestFactory$LongManifest"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (6)
cloud_aws/BUILD.bazel
(2 hunks)cloud_aws/src/test/scala/ai/chronon/integrations/aws/GlueCatalogTest.scala
(1 hunks)spark/src/main/scala/ai/chronon/spark/ChrononKryoRegistrator.scala
(2 hunks)tools/build_rules/dependencies/maven_repository.bzl
(3 hunks)tools/build_rules/dependencies/scala_2_12_repository.bzl
(1 hunks)tools/build_rules/dependencies/scala_2_13_repository.bzl
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- cloud_aws/src/test/scala/ai/chronon/integrations/aws/GlueCatalogTest.scala
- tools/build_rules/dependencies/maven_repository.bzl
- tools/build_rules/dependencies/scala_2_12_repository.bzl
- tools/build_rules/dependencies/scala_2_13_repository.bzl
🧰 Additional context used
🧠 Learnings (1)
spark/src/main/scala/ai/chronon/spark/ChrononKryoRegistrator.scala (1)
Learnt from: nikhil-zlai
PR: zipline-ai/chronon#51
File: spark/src/main/scala/ai/chronon/spark/ChrononKryoRegistrator.scala:192-200
Timestamp: 2024-11-26T19:47:53.900Z
Learning: Only suggest registering Delta Lake action classes for serialization if they are actually used in the codebase.
⏰ Context from checks skipped due to timeout of 90000ms (14)
- GitHub Check: non_spark_tests
- GitHub Check: streaming_tests
- GitHub Check: streaming_tests
- GitHub Check: spark_tests
- GitHub Check: join_tests
- GitHub Check: join_tests
- GitHub Check: groupby_tests
- GitHub Check: groupby_tests
- GitHub Check: analyzer_tests
- GitHub Check: bazel_config_tests
- GitHub Check: analyzer_tests
- GitHub Check: fetcher_tests
- GitHub Check: spark_tests
- GitHub Check: scala_compile_fmt_fix
🔇 Additional comments (6)
cloud_aws/BUILD.bazel (5)
1-20
: New shared_libs variable improves dependency management.Consolidates common dependencies for reuse across targets.
9-9
: Hudi dependencies added to support basic Hudi functionality.Consistent with PR objective to add Hudi support.
Also applies to: 13-13
30-30
: Simplified cloud_aws_lib deps using shared_libs.Good refactoring to improve maintenance.
33-41
: Test dependencies updated for Hudi testing.Added mockito and DynamoDBLocal for testing.
49-49
: Test suite now uses shared_libs directly.Eliminates the need for separate test_lib.
spark/src/main/scala/ai/chronon/spark/ChrononKryoRegistrator.scala (1)
194-217
: Implementation looks good.New Hudi Kryo registrator properly extends base class and registers essential Hudi classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
cloud_aws/BUILD.bazel (1)
1-20
: Move test dependencies from shared_libs to test_deps.JUnit dependencies at lines 10-11 should be in test_deps rather than shared_libs.
shared_libs = [ maven_artifact("software.amazon.awssdk:dynamodb"), maven_artifact("software.amazon.awssdk:regions"), maven_artifact("software.amazon.awssdk:aws-core"), maven_artifact("software.amazon.awssdk:sdk-core"), maven_artifact("software.amazon.awssdk:utils"), maven_artifact("com.google.guava:guava"), maven_artifact("org.slf4j:slf4j-api"), maven_artifact("org.apache.hudi:hudi-aws-bundle"), - maven_artifact("junit:junit"), - maven_artifact("com.novocode:junit-interface"), maven_artifact("com.fasterxml.jackson.module:jackson-module-afterburner"), scala_artifact_with_suffix("org.apache.hudi:hudi-spark3.5-bundle"), scala_artifact_with_suffix("org.scala-lang.modules:scala-collection-compat"), "//api:lib", "//api:thrift_java", "//online:lib", "//spark:lib", "//tools/build_rules/spark:spark-exec", ]Then add to test_deps:
test_deps = [ maven_artifact("com.amazonaws:DynamoDBLocal"), maven_artifact("org.mockito:mockito-core"), maven_artifact("software.amazon.awssdk:auth"), maven_artifact("software.amazon.awssdk:identity-spi"), scala_artifact_with_suffix("com.chuusai:shapeless"), scala_artifact_with_suffix("org.typelevel:cats-core"), scala_artifact_with_suffix("org.mockito:mockito-scala"), + maven_artifact("junit:junit"), + maven_artifact("com.novocode:junit-interface"), ] + _CIRCE_DEPS + _SCALA_TEST_DEPS
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (1)
cloud_aws/BUILD.bazel
(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (16)
- GitHub Check: non_spark_tests
- GitHub Check: streaming_tests
- GitHub Check: non_spark_tests
- GitHub Check: join_tests
- GitHub Check: analyzer_tests
- GitHub Check: groupby_tests
- GitHub Check: spark_tests
- GitHub Check: streaming_tests
- GitHub Check: fetcher_tests
- GitHub Check: spark_tests
- GitHub Check: join_tests
- GitHub Check: fetcher_tests
- GitHub Check: groupby_tests
- GitHub Check: scala_compile_fmt_fix
- GitHub Check: bazel_config_tests
- GitHub Check: analyzer_tests
🔇 Additional comments (3)
cloud_aws/BUILD.bazel (3)
30-30
: Good refactoring!Using the shared_libs variable improves maintainability.
33-41
: Appropriate test dependencies added.Mockito dependencies are suitable for the test environment.
49-49
: Proper dependency management.The updated dependency structure correctly uses both shared_libs and test_deps.
Co-authored-by: Thomas Chow <[email protected]>
af45447
to
f809912
Compare
Co-authored-by: Thomas Chow <[email protected]>
## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Chores** - Consolidated and streamlined build dependencies for improved integration with AWS services and data processing libraries. - Expanded the set of supported third-party libraries, including new artifacts for enhanced performance and compatibility. - Added new dependencies for Hudi, Jackson, and Zookeeper to enhance functionality. - Introduced additional Hudi artifacts for Scala 2.12 and 2.13 to broaden available functionalities. - **Tests** - Added a new test class to verify reliable write/read operations on Hudi tables using a Spark session. - **Refactor** - Enhanced serialization registration to support a broader range of data types, improving overall processing stability. - Introduced a new variable for shared library dependencies to simplify dependency management. <!-- end of auto-generated comment: release notes by coderabbit.ai --> <!-- av pr metadata This information is embedded by the av CLI when creating PRs to track the status of stacks when using Aviator. Please do not delete or edit this section of the PR. ``` {"parent":"main","parentHead":"","trunk":"main"} ``` --> --------- Co-authored-by: Thomas Chow <[email protected]>
## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Chores** - Consolidated and streamlined build dependencies for improved integration with AWS services and data processing libraries. - Expanded the set of supported third-party libraries, including new artifacts for enhanced performance and compatibility. - Added new dependencies for Hudi, Jackson, and Zookeeper to enhance functionality. - Introduced additional Hudi artifacts for Scala 2.12 and 2.13 to broaden available functionalities. - **Tests** - Added a new test class to verify reliable write/read operations on Hudi tables using a Spark session. - **Refactor** - Enhanced serialization registration to support a broader range of data types, improving overall processing stability. - Introduced a new variable for shared library dependencies to simplify dependency management. <!-- end of auto-generated comment: release notes by coderabbit.ai --> <!-- av pr metadata This information is embedded by the av CLI when creating PRs to track the status of stacks when using Aviator. Please do not delete or edit this section of the PR. ``` {"parent":"main","parentHead":"","trunk":"main"} ``` --> --------- Co-authored-by: Thomas Chow <[email protected]>
## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Chores** - Consolidated and streamlined build dependencies for improved integration with AWS services and data processing libraries. - Expanded the set of supported third-party libraries, including new artifacts for enhanced performance and compatibility. - Added new dependencies for Hudi, Jackson, and Zookeeper to enhance functionality. - Introduced additional Hudi artifacts for Scala 2.12 and 2.13 to broaden available functionalities. - **Tests** - Added a new test class to verify reliable write/read operations on Hudi tables using a Spark session. - **Refactor** - Enhanced serialization registration to support a broader range of data types, improving overall processing stability. - Introduced a new variable for shared library dependencies to simplify dependency management. <!-- end of auto-generated comment: release notes by coderabbit.ai --> <!-- av pr metadata This information is embedded by the av CLI when creating PRs to track the status of stacks when using Aviator. Please do not delete or edit this section of the PR. ``` {"parent":"main","parentHead":"","trunk":"main"} ``` --> --------- Co-authored-by: Thomas Chow <[email protected]>
## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Chores** - Consolidated and streamlined build dependencies for improved integration with AWS services and data processing libraries. - Expanded the set of supported third-party libraries, including new artifacts for enhanced performance and compatibility. - Added new dependencies for Hudi, Jackson, and Zookeeper to enhance functionality. - Introduced additional Hudi artifacts for Scala 2.12 and 2.13 to broaden available functionalities. - **Tests** - Added a new test class to verify reliable write/read operations on Hudi tables using a Spark session. - **Refactor** - Enhanced serialization registration to support a broader range of data types, improving overall processing stability. - Introduced a new variable for shared library dependencies to simplify dependency management. <!-- end of auto-generated comment: release notes by coderabbit.ai --> <!-- av pr metadata This information is embedded by the av CLI when creating PRs to track the status of stacks when using Aviator. Please do not delete or edit this section of the PR. ``` {"parent":"main","parentHead":"","trunk":"main"} ``` --> --------- Co-authored-by: Thomas Chow <[email protected]>
## Summary ## Cheour clientslist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Chores** - Consolidated and streamlined build dependencies for improved integration with AWS services and data processing libraries. - Expanded the set of supported third-party libraries, including new artifacts for enhanced performance and compatibility. - Added new dependencies for Hudi, Jaour clientsson, and Zookeeper to enhance functionality. - Introduced additional Hudi artifacts for Scala 2.12 and 2.13 to broaden available functionalities. - **Tests** - Added a new test class to verify reliable write/read operations on Hudi tables using a Spark session. - **Refactor** - Enhanced serialization registration to support a broader range of data types, improving overall processing stability. - Introduced a new variable for shared library dependencies to simplify dependency management. <!-- end of auto-generated comment: release notes by coderabbit.ai --> <!-- av pr metadata This information is embedded by the av CLI when creating PRs to traour clients the status of staour clientss when using Aviator. Please do not delete or edit this section of the PR. ``` {"parent":"main","parentHead":"","trunk":"main"} ``` --> --------- Co-authored-by: Thomas Chow <[email protected]>
Summary
Checklist
Summary by CodeRabbit
Chores
Tests
Refactor