-
Notifications
You must be signed in to change notification settings - Fork 0
Retrieve source uri prefix from hive partitioning options when building the GCS format #204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe pull request modifies the Changes
Possibly related PRs
Suggested Reviewers
Poem
Warning Review ran into problems🔥 ProblemsGitHub Actions: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository. Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings. 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (2)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
10a3af3
to
24e613a
Compare
val formatStr = table.getDefinition | ||
.asInstanceOf[ExternalTableDefinition] | ||
.getFormatOptions | ||
.asInstanceOf[FormatOptions] | ||
.getType | ||
|
||
GCS(table.getTableId.getProject, uris.head, formatStr) | ||
val uri = table.getDefinition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw were you able to verify this works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will know soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm think maybe we should do a fallback here? If the table isn't partitioned could we throw or go to the uri list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks for fixing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (1)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigQueryFormat.scala
(1 hunks)
🔇 Additional comments (1)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigQueryFormat.scala (1)
68-73
: Verify external table configurations in the codebase.Let's check for external table usage patterns.
✅ Verification successful
External table URI access is properly implemented
The code has appropriate type checking and is the only instance of external table URI access in the codebase.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for external table configurations and usages rg -A 5 "ExternalTableDefinition|setHivePartitioningOptions|setSourceUris"Length of output: 2592
Script:
#!/bin/bash # Search for external table URI access patterns rg -A 3 "getSourceUris|getSourceUriPrefix|setSourceUriPrefix|setSourceUris|ExternalTableDefinition.*getUri"Length of output: 601
val uri = table.getDefinition | ||
.asInstanceOf[ExternalTableDefinition] | ||
.getHivePartitioningOptions | ||
.getSourceUriPrefix | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add fallback for non-partitioned tables.
Direct use of getHivePartitioningOptions
will fail for non-partitioned tables.
Add fallback to URI list:
- val uri = table.getDefinition
- .asInstanceOf[ExternalTableDefinition]
- .getHivePartitioningOptions
- .getSourceUriPrefix
+ val externalTable = table.getDefinition.asInstanceOf[ExternalTableDefinition]
+ val uri = Option(externalTable.getHivePartitioningOptions)
+ .map(_.getSourceUriPrefix)
+ .getOrElse {
+ val uris = externalTable.getSourceUris
+ require(uris.size == 1, s"Expected exactly one source URI, got ${uris.size}")
+ uris.get(0).replaceAll("/\\*\\.parquet$", "")
+ }
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
val uri = table.getDefinition | |
.asInstanceOf[ExternalTableDefinition] | |
.getHivePartitioningOptions | |
.getSourceUriPrefix | |
val externalTable = table.getDefinition.asInstanceOf[ExternalTableDefinition] | |
val uri = Option(externalTable.getHivePartitioningOptions) | |
.map(_.getSourceUriPrefix) | |
.getOrElse { | |
val uris = externalTable.getSourceUris | |
require(uris.size == 1, s"Expected exactly one source URI, got ${uris.size}") | |
uris.get(0).replaceAll("/\\*\\.parquet$", "") | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good bot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@david-zlai Thank you! Let me know if you need any further assistance! 😊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get me a coffee
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@david-zlai I would if I could! But I'm just a code-reviewing rabbit. Perhaps I can help review more code while you grab that coffee? 🐰
(_/)
(>☕<)
/ >
ZIPLINE_GCP_DATAPROC_CLUSTER_NAME=canary-2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this change?
.getOrElse { | ||
val uris = externalTable.getSourceUris | ||
require(uris.size == 1, s"External table ${tableName} can be backed by only one URI.") | ||
uris.get(0).replaceAll("/\\*\\.parquet$", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, thanks. Would be great to unit test this but not a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…ng the GCS format (#204) ## Summary ^^^ Context: Previously when using the source uri (not via hive partitioning options), we would run into spark load problems when given a gcs uri to our code that included a suffix of `/*.parquet`. See image below  Instead, we're going to use the `Hive Partitioning Source URI Prefix`. we don't have to strip the suffix for this one since this uri value should be the gcs root folder for the hive table ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Improvements** - Enhanced BigQuery external table handling by simplifying source URI retrieval process. - **Bug Fixes** - Corrected string formatting in the print statement for clarity. - Adjusted indentation for consistency in code formatting. - **Chores** - Updated environment variable `ZIPLINE_GCP_DATAPROC_CLUSTER_NAME` from `zipline-canary-cluster` to `canary-2`. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…ng the GCS format (#204) ## Summary ^^^ Context: Previously when using the source uri (not via hive partitioning options), we would run into spark load problems when given a gcs uri to our code that included a suffix of `/*.parquet`. See image below  Instead, we're going to use the `Hive Partitioning Source URI Prefix`. we don't have to strip the suffix for this one since this uri value should be the gcs root folder for the hive table ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Improvements** - Enhanced BigQuery external table handling by simplifying source URI retrieval process. - **Bug Fixes** - Corrected string formatting in the print statement for clarity. - Adjusted indentation for consistency in code formatting. - **Chores** - Updated environment variable `ZIPLINE_GCP_DATAPROC_CLUSTER_NAME` from `zipline-canary-cluster` to `canary-2`. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…ng the GCS format (#204) ## Summary ^^^ Context: Previously when using the source uri (not via hive partitioning options), we would run into spark load problems when given a gcs uri to our code that included a suffix of `/*.parquet`. See image below  Instead, we're going to use the `Hive Partitioning Source URI Prefix`. we don't have to strip the suffix for this one since this uri value should be the gcs root folder for the hive table ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Improvements** - Enhanced BigQuery external table handling by simplifying source URI retrieval process. - **Bug Fixes** - Corrected string formatting in the print statement for clarity. - Adjusted indentation for consistency in code formatting. - **Chores** - Updated environment variable `ZIPLINE_GCP_DATAPROC_CLUSTER_NAME` from `zipline-canary-cluster` to `canary-2`. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…ng the GCS format (#204) ## Summary ^^^ Context: Previously when using the source uri (not via hive partitioning options), we would run into spark load problems when given a gcs uri to our code that included a suffix of `/*.parquet`. See image below  Instead, we're going to use the `Hive Partitioning Source URI Prefix`. we don't have to strip the suffix for this one since this uri value should be the gcs root folder for the hive table ## Cheour clientslist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Improvements** - Enhanced BigQuery external table handling by simplifying source URI retrieval process. - **Bug Fixes** - Corrected string formatting in the print statement for clarity. - Adjusted indentation for consistency in code formatting. - **Chores** - Updated environment variable `ZIPLINE_GCP_DATAPROC_CLUSTER_NAME` from `zipline-canary-cluster` to `canary-2`. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
^^^
Context: Previously when using the source uri (not via hive partitioning options), we would run into spark load problems when given a gcs uri to our code that included a suffix of

/*.parquet
. See image belowInstead, we're going to use the
Hive Partitioning Source URI Prefix
. we don't have to strip the suffix for this one since this uri value should be the gcs root folder for the hive tableChecklist
Summary by CodeRabbit
ZIPLINE_GCP_DATAPROC_CLUSTER_NAME
fromzipline-canary-cluster
tocanary-2
.