-
Notifications
You must be signed in to change notification settings - Fork 70
Add support for Delta lake tables in TableUtils #869
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Delta lake tables in TableUtils #869
Conversation
sparkSession.sqlContext | ||
.sql(s"SHOW PARTITIONS $tableName") | ||
.collect() | ||
.map(row => parseHivePartition(row.getString(0))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious if the partition keys here are enforced by Chronon? i.e I'm wondering if users can specified partition key and we have increasing cardinality of this and we end up returning large entries of this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the partition keys chosen / used here are based on the partitionColumn
that you configure in the table util setup for your job(s). You are right that you could have a very large number of partitions here loaded up in mem.
620a789
to
df4f9cc
Compare
@nikhilsimha / @mears-stripe / @pengyu-hou - would you folks have cycles to take a look? |
c1df43b
to
87e0986
Compare
1fffb45
to
7225a69
Compare
// allow us to override the format by specifying env vars. This allows us to not have to worry about interference | ||
// between Spark sessions created in existing chronon tests that need the hive format and some specific tests | ||
// that require a format override like delta lake. | ||
val (formatConfigs, kryoRegistrator) = sys.env.get(FormatTestEnvVar) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this is only used for local testing. Could we move it within the if (local)
block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could - I actually thought the code is cleaner with the match up front - this allows us to skip the use of vars and a lot of if/else type of checks in a couple of places.
Let me know if you feel strongly and I can add there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excited for this change!
add thrift 18 using user format foo more log add log lines morel ong lines only use delta format need to commit try add duck wip wip add log line provider revert thrift
087f81b
to
e3d6632
Compare
ae010c1
to
268f765
Compare
268f765
to
5673ac4
Compare
|
||
import scala.util.Try | ||
|
||
class TableUtilsFormatTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piyush-zlai If this test specific to delta lake?
Is it reasonable to add other formats like Hive or Iceberg to the CI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is largely agnostic to the format. As part of the regular "sbt test" run as we don't specify the format, this suite gets run for the default (hive).
I'll circle back with Iceberg tests in a follow up - For that I might need to wire up the iceberg spark extensions etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Summary Port of our OSS delta lake PR - airbnb/chronon#869. Largely the same aside from delta lake versions. We don't need this immediately atm but we'll need this if we have other users come along that need delta lake (or we need to add support for formats like hudi) ## Checklist - [X] Added Unit Tests - [X] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Added support for Delta Lake operations with new dependencies and configurations. - Introduced new traits and case objects for handling different table formats, enhancing data management capabilities. - Added a new job in the CI workflow for testing Delta Lake format functionality. - **Bug Fixes** - Improved error handling in class registration processes. - **Tests** - Implemented a suite of unit tests for the `TableUtils` class to validate partitioned data insertions with schema modifications. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Port of our OSS delta lake PR - airbnb/chronon#869. Largely the same aside from delta lake versions. We don't need this immediately atm but we'll need this if we have other users come along that need delta lake (or we need to add support for formats like hudi) ## Checklist - [X] Added Unit Tests - [X] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Added support for Delta Lake operations with new dependencies and configurations. - Introduced new traits and case objects for handling different table formats, enhancing data management capabilities. - Added a new job in the CI workflow for testing Delta Lake format functionality. - **Bug Fixes** - Improved error handling in class registration processes. - **Tests** - Implemented a suite of unit tests for the `TableUtils` class to validate partitioned data insertions with schema modifications. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Port of our OSS delta lake PR - airbnb/chronon#869. Largely the same aside from delta lake versions. We don't need this immediately atm but we'll need this if we have other users come along that need delta lake (or we need to add support for formats like hudi) ## Checklist - [X] Added Unit Tests - [X] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Added support for Delta Lake operations with new dependencies and configurations. - Introduced new traits and case objects for handling different table formats, enhancing data management capabilities. - Added a new job in the CI workflow for testing Delta Lake format functionality. - **Bug Fixes** - Improved error handling in class registration processes. - **Tests** - Implemented a suite of unit tests for the `TableUtils` class to validate partitioned data insertions with schema modifications. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Port of our OSS delta lake PR - airbnb/chronon#869. Largely the same aside from delta lake versions. We don't need this immediately atm but we'll need this if we have other users come along that need delta lake (or we need to add support for formats like hudi) ## Checklist - [X] Added Unit Tests - [X] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Added support for Delta Lake operations with new dependencies and configurations. - Introduced new traits and case objects for handling different table formats, enhancing data management capabilities. - Added a new job in the CI workflow for testing Delta Lake format functionality. - **Bug Fixes** - Improved error handling in class registration processes. - **Tests** - Implemented a suite of unit tests for the `TableUtils` class to validate partitioned data insertions with schema modifications. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Port of our OSS delta lake PR - airbnb/chronon#869. Largely the same aside from delta lake versions. We don't need this immediately atm but we'll need this if we have other users come along that need delta lake (or we need to add support for formats like hudi) ## Cheour clientslist - [X] Added Unit Tests - [X] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Added support for Delta Lake operations with new dependencies and configurations. - Introduced new traits and case objects for handling different table formats, enhancing data management capabilities. - Added a new job in the CI workflow for testing Delta Lake format functionality. - **Bug Fixes** - Improved error handling in class registration processes. - **Tests** - Implemented a suite of unit tests for the `TableUtils` class to validate partitioned data insertions with schema modifications. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Add delta lake support in Chronon. This PR refactors TableUtils a bit to allow us to support Delta Lake (and other table formats in the future like Hudi). The crux of the PR is the refactor of TableUtils - I've pulled out some of the format specific stuff into per-format objects (e.g. Hive / Iceberg / Delta) to help us cover aspects like creating tables, listing partitions.
As the versions of the various libraries we build against in Chronon is on the older side (e.g. delta lake - 2.0.x for Spark 3.2), we've structured the code to allow users to override their format providers. This allows folks say on Delta 3.x (Spark 3.5) to write their own Delta3xFormatProvider that is built against the newer version and configure it via the
spark.chronon.table.format_provider
Spark config setting. For users on the same versions / rails as OSS, they can skip configuring this and we'll default to the DefaultFormatProvider (which maintains the existing Iceberg / Hive behavior along with support for delta).Notes on testing:
Why / Goal
Allow folks keen on experimenting with Chronon that use delta lake to do so. Set the stage for supporting additional table formats in the future.
Test Plan
@mickjermsurawong-openai was also able to pull these changes and test things out using delta lake tables. We were able to sort out issues with Delta 3.x (as they're on Spark 3.5) using the provider interface.
Checklist
Reviewers