Retrieve source uri prefix from hive partitioning options when building the GCS format #204

david-zlai · 2025-01-11T21:21:04Z

Summary

^^^

Context: Previously when using the source uri (not via hive partitioning options), we would run into spark load problems when given a gcs uri to our code that included a suffix of /*.parquet. See image below

Instead, we're going to use the Hive Partitioning Source URI Prefix. we don't have to strip the suffix for this one since this uri value should be the gcs root folder for the hive table

Checklist

Added Unit Tests
Covered by existing CI
Integration tested
Documentation update

Summary by CodeRabbit

Improvements
- Enhanced BigQuery external table handling by simplifying source URI retrieval process.
Bug Fixes
- Corrected string formatting in the print statement for clarity.
- Adjusted indentation for consistency in code formatting.
Chores
- Updated environment variable ZIPLINE_GCP_DATAPROC_CLUSTER_NAME from zipline-canary-cluster to canary-2.

coderabbitai · 2025-01-11T21:21:11Z

Walkthrough

The pull request modifies the BigQueryFormat.scala file, specifically updating the logic in the GcpFormatProvider class for handling external tables. The changes prioritize retrieving a single source URI prefix from the getHivePartitioningOptions.getSourceUriPrefix method, allowing for more flexible URI handling. If no prefix is available, the original logic for asserting a single URI from the getSourceUris method is retained. Additionally, minor corrections were made in run.py and an environment variable was updated in chronon_dataproc_submitter.env.

Changes

File	Change Summary
`cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigQueryFormat.scala`	- Updated logic for source URI retrieval in `format` method - Replaced assertion with `require` for URI length check
`api/py/ai/chronon/repo/run.py`	- Corrected string formatting in `download_only_once` function - Adjusted indentation in `generate_dataproc_submitter_args` function
`chronon_dataproc_submitter.env`	- Updated `ZIPLINE_GCP_DATAPROC_CLUSTER_NAME` from `zipline-canary-cluster` to `canary-2`

Possibly related PRs

feat: introduce BigQueryFormat #146: The changes in BigQueryFormat.scala directly relate to the introduction of the GCPFormatProvider class.
feat: TableUtils to be compatible with DataPointer (part 1) #158: Modifications to the GcpFormatProvider class in BigQueryFormat.scala, enhancing format handling.
Temporary hack to work with partition values stored as yyyy-MM-dd but represented as yyyyMMdd in information schema in bq. #175: Changes to the partitions method in the BQuery class, relevant to partition value processing.

Suggested Reviewers

nikhil-zlai
piyush-zlai
tchow-zlai

Poem

In clouds of data, swift and bright,
BigQuery's format takes its flight 🚀
One URI, no list in sight,
Simplicity brings pure delight!
Code refactored, clean and tight 📊

Warning

Review ran into problems

🔥 Problems

GitHub Actions: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository.

Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between b778d03 and b7e9ea2.

📒 Files selected for processing (1)

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigQueryFormat.scala (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigQueryFormat.scala

⏰ Context from checks skipped due to timeout of 90000ms (2)

GitHub Check: no_spark_scala_tests
GitHub Check: scala_compile_fmt_fix

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

tchow-zlai · 2025-01-13T19:21:32Z

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigQueryFormat.scala

          val formatStr = table.getDefinition
            .asInstanceOf[ExternalTableDefinition]
            .getFormatOptions
            .asInstanceOf[FormatOptions]
            .getType

-          GCS(table.getTableId.getProject, uris.head, formatStr)
+          val uri = table.getDefinition


btw were you able to verify this works?

will know soon

yep this works https://console.cloud.google.com/dataproc/jobs/8d220039-132d-440e-8132-a195df91e13f/monitoring?region=us-central1&project=canary-443022

hmm think maybe we should do a fallback here? If the table isn't partitioned could we throw or go to the uri list?

ah, good idea

tchow-zlai

LGTM thanks for fixing!

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 86f8f8e and 4f5eb6a.

📒 Files selected for processing (1)

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigQueryFormat.scala (1 hunks)

🔇 Additional comments (1)

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigQueryFormat.scala (1)

68-73: Verify external table configurations in the codebase.

Let's check for external table usage patterns.

✅ Verification successful

External table URI access is properly implemented

The code has appropriate type checking and is the only instance of external table URI access in the codebase.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Search for external table configurations and usages
rg -A 5 "ExternalTableDefinition|setHivePartitioningOptions|setSourceUris"
Length of output: 2592

Script:
#!/bin/bash
# Search for external table URI access patterns
rg -A 3 "getSourceUris|getSourceUriPrefix|setSourceUriPrefix|setSourceUris|ExternalTableDefinition.*getUri"
Length of output: 601

coderabbitai · 2025-01-13T19:41:41Z

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigQueryFormat.scala

+          val uri = table.getDefinition
+            .asInstanceOf[ExternalTableDefinition]
+            .getHivePartitioningOptions
+            .getSourceUriPrefix
+


⚠️ Potential issue

Add fallback for non-partitioned tables.

Direct use of getHivePartitioningOptions will fail for non-partitioned tables.

Add fallback to URI list:

- val uri = table.getDefinition - .asInstanceOf[ExternalTableDefinition] - .getHivePartitioningOptions - .getSourceUriPrefix + val externalTable = table.getDefinition.asInstanceOf[ExternalTableDefinition] + val uri = Option(externalTable.getHivePartitioningOptions) + .map(_.getSourceUriPrefix) + .getOrElse { + val uris = externalTable.getSourceUris + require(uris.size == 1, s"Expected exactly one source URI, got ${uris.size}") + uris.get(0).replaceAll("/\\*\\.parquet$", "") + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

val uri = table.getDefinition

.asInstanceOf[ExternalTableDefinition]

.getHivePartitioningOptions

.getSourceUriPrefix

val externalTable = table.getDefinition.asInstanceOf[ExternalTableDefinition]

val uri = Option(externalTable.getHivePartitioningOptions)

.map(_.getSourceUriPrefix)

.getOrElse {

val uris = externalTable.getSourceUris

require(uris.size == 1, s"Expected exactly one source URI, got ${uris.size}")

uris.get(0).replaceAll("/\\*\\.parquet$", "")

}

@david-zlai Thank you! Let me know if you need any further assistance! 😊

get me a coffee

@david-zlai I would if I could! But I'm just a code-reviewing rabbit. Perhaps I can help review more code while you grab that coffee? 🐰

(_/)
(>☕<)
/ >

tchow-zlai · 2025-01-13T22:07:19Z

chronon_dataproc_submitter.env

why this change?

tchow-zlai · 2025-01-13T22:09:24Z

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigQueryFormat.scala

+            .getOrElse {
+              val uris = externalTable.getSourceUris
+              require(uris.size == 1, s"External table ${tableName} can be backed by only one URI.")
+              uris.get(0).replaceAll("/\\*\\.parquet$", "")


sounds good, thanks. Would be great to unit test this but not a big deal.

tchow-zlai

LGTM

…ng the GCS format (#204) ## Summary ^^^ Context: Previously when using the source uri (not via hive partitioning options), we would run into spark load problems when given a gcs uri to our code that included a suffix of `/*.parquet`. See image below ![image](https://github.com/user-attachments/assets/2f151e54-f800-4e3f-a11a-214f97f3549d) Instead, we're going to use the `Hive Partitioning Source URI Prefix`. we don't have to strip the suffix for this one since this uri value should be the gcs root folder for the hive table ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Improvements** - Enhanced BigQuery external table handling by simplifying source URI retrieval process. - **Bug Fixes** - Corrected string formatting in the print statement for clarity. - Adjusted indentation for consistency in code formatting. - **Chores** - Updated environment variable `ZIPLINE_GCP_DATAPROC_CLUSTER_NAME` from `zipline-canary-cluster` to `canary-2`.

…ng the GCS format (#204) ## Summary ^^^ Context: Previously when using the source uri (not via hive partitioning options), we would run into spark load problems when given a gcs uri to our code that included a suffix of `/*.parquet`. See image below ![image](https://github.com/user-attachments/assets/2f151e54-f800-4e3f-a11a-214f97f3549d) Instead, we're going to use the `Hive Partitioning Source URI Prefix`. we don't have to strip the suffix for this one since this uri value should be the gcs root folder for the hive table ## Cheour clientslist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Improvements** - Enhanced BigQuery external table handling by simplifying source URI retrieval process. - **Bug Fixes** - Corrected string formatting in the print statement for clarity. - Adjusted indentation for consistency in code formatting. - **Chores** - Updated environment variable `ZIPLINE_GCP_DATAPROC_CLUSTER_NAME` from `zipline-canary-cluster` to `canary-2`.

david-zlai added 4 commits January 13, 2025 13:41

Fix uri suffix removal for when external table has a uri like /*.parquet

19c8c7b

actually strip smh

d96d42d

format is uppercase so need to lower

68b8650

use hive partitioning option source uri prefix.

24e613a

david-zlai force-pushed the davidhan/fix_uri_suffix branch from 10a3af3 to 24e613a Compare January 13, 2025 19:20

tchow-zlai reviewed Jan 13, 2025

View reviewed changes

tchow-zlai approved these changes Jan 13, 2025

View reviewed changes

fix scalafix

4f5eb6a

david-zlai changed the title ~~Fix uri suffix removal for when external table has a uri like /*.parquet~~ Retrieve source uri prefix from hive partitioning options when building the Format Jan 13, 2025

david-zlai changed the title ~~Retrieve source uri prefix from hive partitioning options when building the Format~~ Retrieve source uri prefix from hive partitioning options when building the GCS format Jan 13, 2025

david-zlai marked this pull request as ready for review January 13, 2025 19:40

coderabbitai bot reviewed Jan 13, 2025

View reviewed changes

david-zlai added 3 commits January 13, 2025 14:45

fallback

7f46130

coderabbit did it better than me

b778d03

scalafix:

b7e9ea2

tchow-zlai reviewed Jan 13, 2025

View reviewed changes

chronon_dataproc_submitter.env

ZIPLINE_GCP_DATAPROC_CLUSTER_NAME=canary-2

Copy link

Collaborator

tchow-zlai Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change?

tchow-zlai reviewed Jan 13, 2025

View reviewed changes

tchow-zlai approved these changes Jan 13, 2025

View reviewed changes

david-zlai merged commit b94731a into main Jan 14, 2025
6 checks passed

david-zlai deleted the davidhan/fix_uri_suffix branch January 14, 2025 19:12

coderabbitai bot mentioned this pull request Feb 21, 2025

fix: glob only parquet files in external tables #419

Closed

4 tasks

coderabbitai bot mentioned this pull request Mar 31, 2025

Temporary fix for analyzer since full conf path isn't preserved on dataproc at the moment #563

Merged

4 tasks

Retrieve source uri prefix from hive partitioning options when building the GCS format #204

Retrieve source uri prefix from hive partitioning options when building the GCS format #204

Uh oh!

Conversation

david-zlai commented Jan 11, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly related PRs

Suggested Reviewers

Poem

Review ran into problems

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

tchow-zlai Jan 13, 2025

Choose a reason for hiding this comment

Uh oh!

david-zlai Jan 13, 2025

Choose a reason for hiding this comment

Uh oh!

david-zlai Jan 13, 2025

Choose a reason for hiding this comment

Uh oh!

tchow-zlai Jan 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-zlai Jan 13, 2025

Choose a reason for hiding this comment

Uh oh!

tchow-zlai left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-zlai Jan 13, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 13, 2025

Choose a reason for hiding this comment

Uh oh!

david-zlai Jan 13, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 13, 2025

Choose a reason for hiding this comment

Uh oh!

tchow-zlai Jan 13, 2025

Choose a reason for hiding this comment

Uh oh!

tchow-zlai Jan 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tchow-zlai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

david-zlai commented Jan 11, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 11, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

tchow-zlai Jan 13, 2025 •

edited

Loading

coderabbitai bot Jan 13, 2025 •

edited

Loading

tchow-zlai Jan 13, 2025 •

edited

Loading