Skip to content

Add support for Delta lake tables in TableUtils #869

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Nov 19, 2024

Conversation

piyush-zlai
Copy link
Collaborator

@piyush-zlai piyush-zlai commented Oct 21, 2024

Summary

Add delta lake support in Chronon. This PR refactors TableUtils a bit to allow us to support Delta Lake (and other table formats in the future like Hudi). The crux of the PR is the refactor of TableUtils - I've pulled out some of the format specific stuff into per-format objects (e.g. Hive / Iceberg / Delta) to help us cover aspects like creating tables, listing partitions.

As the versions of the various libraries we build against in Chronon is on the older side (e.g. delta lake - 2.0.x for Spark 3.2), we've structured the code to allow users to override their format providers. This allows folks say on Delta 3.x (Spark 3.5) to write their own Delta3xFormatProvider that is built against the newer version and configure it via the spark.chronon.table.format_provider Spark config setting. For users on the same versions / rails as OSS, they can skip configuring this and we'll default to the DefaultFormatProvider (which maintains the existing Iceberg / Hive behavior along with support for delta).

Notes on testing:

  • TableUtilsTest is un-changed. This test covers a bunch of Hive related tests (like creating tables, inserting partitions) and some format agnostic stuff (like schema evolution type of checks).
  • Added a TableUtilsFormat test - for now this covers testing Delta. Can add Iceberg in a follow up (as Iceberg format support is currently untested in the project).
  • Unfortunately the way we create Spark sessions in not ideal in the Chronon project. We kick off all our tests in parallel at the same time in CI (same JVM) and we end up triggering code like: SchemaEvolutionTest where we are creating the SparkSession at the class object construction time. This results in many tests vying to create sessions at the same time - most with the same spark settings but some like the new delta format test with different settings (using the delta catalog, delta spark session extension). Thanks to this we're not able to run the delta tests in the test JVM as the others AND we need to ensure that the other test spark sessions don't end up getting created with the wrong settings as that will cause the delta test to fail.
  • The potentially nice way to handle this would be to refactor all our existing spark tests to postpone session creation till the test body and handle session cleanup etc but as this is a fairly involved / risky change, I've chosen to instead use an env var to have the Spark session builder choose the right delta / non-delta options. It's not the cleanest but it does minimize the blast radius of these changes considerably.

Why / Goal

Allow folks keen on experimenting with Chronon that use delta lake to do so. Set the stage for supporting additional table formats in the future.

Test Plan

  • Added Unit Tests
  • Covered by existing CI
  • Integration tested

@mickjermsurawong-openai was also able to pull these changes and test things out using delta lake tables. We were able to sort out issues with Delta 3.x (as they're on Spark 3.5) using the provider interface.

Checklist

  • Documentation update

Reviewers

sparkSession.sqlContext
.sql(s"SHOW PARTITIONS $tableName")
.collect()
.map(row => parseHivePartition(row.getString(0)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if the partition keys here are enforced by Chronon? i.e I'm wondering if users can specified partition key and we have increasing cardinality of this and we end up returning large entries of this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the partition keys chosen / used here are based on the partitionColumn that you configure in the table util setup for your job(s). You are right that you could have a very large number of partitions here loaded up in mem.

@piyush-zlai piyush-zlai force-pushed the piyush--deltalake branch 2 times, most recently from 620a789 to df4f9cc Compare October 27, 2024 21:14
@piyush-zlai piyush-zlai changed the title [wip] Add support for Delta lake tables in TableUtils Add support for Delta lake tables in TableUtils Oct 28, 2024
@piyush-zlai
Copy link
Collaborator Author

@nikhilsimha / @mears-stripe / @pengyu-hou - would you folks have cycles to take a look?
@mickjermsurawong-openai - feel free to give this a spin as well.

// allow us to override the format by specifying env vars. This allows us to not have to worry about interference
// between Spark sessions created in existing chronon tests that need the hive format and some specific tests
// that require a format override like delta lake.
val (formatConfigs, kryoRegistrator) = sys.env.get(FormatTestEnvVar) match {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this is only used for local testing. Could we move it within the if (local) block?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could - I actually thought the code is cleaner with the match up front - this allows us to skip the use of vars and a lot of if/else type of checks in a couple of places.
Let me know if you feel strongly and I can add there.

Copy link
Collaborator

@mears-stripe mears-stripe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excited for this change!


import scala.util.Try

class TableUtilsFormatTest {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piyush-zlai If this test specific to delta lake?

Is it reasonable to add other formats like Hive or Iceberg to the CI?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is largely agnostic to the format. As part of the regular "sbt test" run as we don't specify the format, this suite gets run for the default (hive).
I'll circle back with Iceberg tests in a follow up - For that I might need to wire up the iceberg spark extensions etc.

Copy link
Collaborator

@pengyu-hou pengyu-hou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes.
Looking forward to more changes from zipline-ai!

thankyou

@caiocamatta-stripe caiocamatta-stripe merged commit b14f886 into airbnb:main Nov 19, 2024
8 checks passed
piyush-zlai added a commit to zipline-ai/chronon that referenced this pull request Dec 2, 2024
## Summary
Port of our OSS delta lake PR -
airbnb/chronon#869. Largely the same aside from
delta lake versions. We don't need this immediately atm but we'll need
this if we have other users come along that need delta lake (or we need
to add support for formats like hudi)

## Checklist
- [X] Added Unit Tests
- [X] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added support for Delta Lake operations with new dependencies and
configurations.
- Introduced new traits and case objects for handling different table
formats, enhancing data management capabilities.
- Added a new job in the CI workflow for testing Delta Lake format
functionality.

- **Bug Fixes**
	- Improved error handling in class registration processes.

- **Tests**
- Implemented a suite of unit tests for the `TableUtils` class to
validate partitioned data insertions with schema modifications.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
kumar-zlai pushed a commit to zipline-ai/chronon that referenced this pull request Apr 25, 2025
## Summary
Port of our OSS delta lake PR -
airbnb/chronon#869. Largely the same aside from
delta lake versions. We don't need this immediately atm but we'll need
this if we have other users come along that need delta lake (or we need
to add support for formats like hudi)

## Checklist
- [X] Added Unit Tests
- [X] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added support for Delta Lake operations with new dependencies and
configurations.
- Introduced new traits and case objects for handling different table
formats, enhancing data management capabilities.
- Added a new job in the CI workflow for testing Delta Lake format
functionality.

- **Bug Fixes**
	- Improved error handling in class registration processes.

- **Tests**
- Implemented a suite of unit tests for the `TableUtils` class to
validate partitioned data insertions with schema modifications.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
kumar-zlai pushed a commit to zipline-ai/chronon that referenced this pull request Apr 29, 2025
## Summary
Port of our OSS delta lake PR -
airbnb/chronon#869. Largely the same aside from
delta lake versions. We don't need this immediately atm but we'll need
this if we have other users come along that need delta lake (or we need
to add support for formats like hudi)

## Checklist
- [X] Added Unit Tests
- [X] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added support for Delta Lake operations with new dependencies and
configurations.
- Introduced new traits and case objects for handling different table
formats, enhancing data management capabilities.
- Added a new job in the CI workflow for testing Delta Lake format
functionality.

- **Bug Fixes**
	- Improved error handling in class registration processes.

- **Tests**
- Implemented a suite of unit tests for the `TableUtils` class to
validate partitioned data insertions with schema modifications.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
chewy-zlai pushed a commit to zipline-ai/chronon that referenced this pull request May 15, 2025
## Summary
Port of our OSS delta lake PR -
airbnb/chronon#869. Largely the same aside from
delta lake versions. We don't need this immediately atm but we'll need
this if we have other users come along that need delta lake (or we need
to add support for formats like hudi)

## Checklist
- [X] Added Unit Tests
- [X] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added support for Delta Lake operations with new dependencies and
configurations.
- Introduced new traits and case objects for handling different table
formats, enhancing data management capabilities.
- Added a new job in the CI workflow for testing Delta Lake format
functionality.

- **Bug Fixes**
	- Improved error handling in class registration processes.

- **Tests**
- Implemented a suite of unit tests for the `TableUtils` class to
validate partitioned data insertions with schema modifications.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
chewy-zlai pushed a commit to zipline-ai/chronon that referenced this pull request May 16, 2025
## Summary
Port of our OSS delta lake PR -
airbnb/chronon#869. Largely the same aside from
delta lake versions. We don't need this immediately atm but we'll need
this if we have other users come along that need delta lake (or we need
to add support for formats like hudi)

## Cheour clientslist
- [X] Added Unit Tests
- [X] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added support for Delta Lake operations with new dependencies and
configurations.
- Introduced new traits and case objects for handling different table
formats, enhancing data management capabilities.
- Added a new job in the CI workflow for testing Delta Lake format
functionality.

- **Bug Fixes**
	- Improved error handling in class registration processes.

- **Tests**
- Implemented a suite of unit tests for the `TableUtils` class to
validate partitioned data insertions with schema modifications.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants