-
Notifications
You must be signed in to change notification settings - Fork 0
WIP -- Allow setting partition column in Query and StagingQuery #366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis update introduces new optional fields to the API’s data models and refactors the codebase to use a unified default partition column. References to the old partition column have been replaced with the new default variant across multiple modules, including Spark core, utilities, and tests. Method signatures and parameter types have been updated to standardize partition handling and improve table metadata management via TableInfo. Changes
Sequence Diagram(s)sequenceDiagram
participant C as Client
participant QB as Query Builder
participant T as Thrift API
C->>QB: Create Query (incl. partitionColumn)
QB->>T: Call Query.apply(..., partitionColumn)
T->>T: Build Query struct with partitionColumn field
T-->>C: Return Query object
sequenceDiagram
participant JP as Join Process
participant TU as TableUtils
participant DF as DataFrame
JP->>TU: Request defaultPartitionColumn
TU-->>JP: Return partition value
JP->>DF: Apply partition filter in join & validation
Possibly related PRs
Suggested reviewers
Poem
Warning Review ran into problems🔥 ProblemsGitHub Actions and Pipeline Checks: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository. Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
spark/src/main/scala/ai/chronon/spark/TableUtils.scala (1)
842-845
: Add input validation for table name.Consider validating the table name parameter to prevent empty or malformed values.
case class TableInfo(table: String, partitionColumn: Option[String] = None)(implicit tableUtils: TableUtils) { + require(table != null && table.trim.nonEmpty, "Table name cannot be null or empty") def getPartitionColumn: String = partitionColumn.getOrElse(tableUtils.defaultPartitionColumn) }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (3)
spark/src/main/scala/ai/chronon/spark/GroupBy.scala
(13 hunks)spark/src/main/scala/ai/chronon/spark/LogFlattenerJob.scala
(3 hunks)spark/src/main/scala/ai/chronon/spark/TableUtils.scala
(15 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- spark/src/main/scala/ai/chronon/spark/GroupBy.scala
- spark/src/main/scala/ai/chronon/spark/LogFlattenerJob.scala
⏰ Context from checks skipped due to timeout of 90000ms (4)
- GitHub Check: spark_tests
- GitHub Check: non_spark_tests
- GitHub Check: scala_compile_fmt_fix
- GitHub Check: enforce_triggered_workflows
🔇 Additional comments (2)
spark/src/main/scala/ai/chronon/spark/TableUtils.scala (2)
69-70
: LGTM: Descriptive variable name.The rename from
partitionColumn
todefaultPartitionColumn
better reflects its role as a fallback value.
135-138
: LGTM: Consistent refactoring to use TableInfo.Methods have been systematically updated to use TableInfo, improving encapsulation of table-related data.
Also applies to: 140-144, 250-253, 255-284, 286-298
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
spark/src/main/scala/ai/chronon/spark/TableUtils.scala (1)
842-845
: Add validation for table name.Consider validating the table name parameter to ensure it's not empty or malformed.
case class TableInfo(table: String, partitionColumn: Option[String] = None)(implicit tableUtils: TableUtils) { + require(table != null && table.trim.nonEmpty, "Table name cannot be null or empty") def getPartitionColumn: String = partitionColumn.getOrElse(tableUtils.defaultPartitionColumn) }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (13)
api/src/main/scala/ai/chronon/api/Builders.scala
(2 hunks)spark/src/main/scala/ai/chronon/spark/Analyzer.scala
(9 hunks)spark/src/main/scala/ai/chronon/spark/BootstrapInfo.scala
(5 hunks)spark/src/main/scala/ai/chronon/spark/Extensions.scala
(6 hunks)spark/src/main/scala/ai/chronon/spark/FastHashing.scala
(1 hunks)spark/src/main/scala/ai/chronon/spark/GroupBy.scala
(15 hunks)spark/src/main/scala/ai/chronon/spark/Join.scala
(5 hunks)spark/src/main/scala/ai/chronon/spark/JoinBase.scala
(9 hunks)spark/src/main/scala/ai/chronon/spark/JoinUtils.scala
(4 hunks)spark/src/main/scala/ai/chronon/spark/LabelJoin.scala
(6 hunks)spark/src/main/scala/ai/chronon/spark/TableUtils.scala
(15 hunks)spark/src/test/scala/ai/chronon/spark/test/DataFrameGen.scala
(4 hunks)spark/src/test/scala/ai/chronon/spark/test/JoinTest.scala
(5 hunks)
✅ Files skipped from review due to trivial changes (1)
- spark/src/main/scala/ai/chronon/spark/FastHashing.scala
🚧 Files skipped from review as they are similar to previous changes (8)
- spark/src/main/scala/ai/chronon/spark/BootstrapInfo.scala
- spark/src/main/scala/ai/chronon/spark/LabelJoin.scala
- spark/src/main/scala/ai/chronon/spark/GroupBy.scala
- spark/src/test/scala/ai/chronon/spark/test/DataFrameGen.scala
- spark/src/main/scala/ai/chronon/spark/Join.scala
- spark/src/main/scala/ai/chronon/spark/Analyzer.scala
- spark/src/main/scala/ai/chronon/spark/JoinBase.scala
- spark/src/main/scala/ai/chronon/spark/Extensions.scala
⏰ Context from checks skipped due to timeout of 90000ms (4)
- GitHub Check: non_spark_tests
- GitHub Check: scala_compile_fmt_fix
- GitHub Check: spark_tests
- GitHub Check: enforce_triggered_workflows
🔇 Additional comments (12)
api/src/main/scala/ai/chronon/api/Builders.scala (2)
49-50
: LGTM! Method signature maintains backward compatibility.The addition of
partitionColumn
with a default value ofnull
ensures existing code continues to work.
63-63
: LGTM! Consistent implementation pattern.The
setPartitionColumn
call follows the established pattern for optional parameters.spark/src/test/scala/ai/chronon/spark/test/JoinTest.scala (3)
58-58
: LGTM! Simplified class inheritance.The removal of
TaggedFilterSuite
trait simplifies the test class structure.
177-177
: LGTM! Added partition column support.The changes correctly implement partition column support in the test case:
- Added
partitionColumn
parameter toDataFrameGen.events
- Added
partitionColumn
field in the query configurationAlso applies to: 189-190
434-555
: LGTM! Added comprehensive test for partition column functionality.The new test case thoroughly validates the partition column feature by:
- Testing partition column specification in data generation
- Verifying partition column usage in queries
- Validating join operations with custom partition columns
spark/src/main/scala/ai/chronon/spark/TableUtils.scala (3)
69-70
: LGTM: Clear naming for default partition column.The rename from
partitionColumn
todefaultPartitionColumn
better reflects its role as a fallback value.
135-138
: LGTM: Clean overloading pattern for isPartitioned.The refactoring maintains backward compatibility while introducing the new TableInfo parameter.
Also applies to: 140-144
188-191
: LGTM: Consistent refactoring of partitions method.The method properly handles both string-based and TableInfo-based calls, with clear delegation pattern.
Also applies to: 193-209
spark/src/main/scala/ai/chronon/spark/JoinUtils.scala (4)
84-84
: LGTM! Using query's effective partition column.The change improves flexibility by using the query's partition column configuration.
118-118
: LGTM! Using tableInfo for metadata access.More consistent approach to accessing table metadata.
241-241
: LGTM! Using default partition column.Standardizes partition column usage across the codebase.
282-283
:✅ Verification successful
LGTM! Updated partition column assertions.
Maintains consistency in partition column validation.
🏁 Script executed:
#!/bin/bash # Verify that all partition column references use defaultPartitionColumn rg -A 2 "partitionColumn" --type scalaLength of output: 40522
Partition column assertions verified and consistent.
The changes inJoinUtils.scala
(lines 282-283) now correctly enforce that the partition keys include bothtableUtils.defaultPartitionColumn
andConstants.LabelPartitionColumn
, aligning well with tests and similar patterns across the repository.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
spark/src/test/scala/ai/chronon/spark/test/JoinTest.scala (1)
129-130
: Consider documenting the rationale for reducing entity count.Reducing entities from 1000 to 300/50 might affect test coverage. If this is for performance, consider adding a comment explaining the decision.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (3)
.bazelignore
(1 hunks)spark/src/main/scala/ai/chronon/spark/TableUtils.scala
(13 hunks)spark/src/test/scala/ai/chronon/spark/test/JoinTest.scala
(27 hunks)
✅ Files skipped from review due to trivial changes (1)
- .bazelignore
🚧 Files skipped from review as they are similar to previous changes (1)
- spark/src/main/scala/ai/chronon/spark/TableUtils.scala
🔇 Additional comments (2)
spark/src/test/scala/ai/chronon/spark/test/JoinTest.scala (2)
177-178
: LGTM! Explicit partition column improves clarity.The addition of explicit partitionColumn parameter makes the test setup more clear and aligns with the PR objectives.
434-555
: LGTM! Well-structured test for partition column feature.The new test case thoroughly verifies the partition column functionality with appropriate assertions and error checks.
@@ -55,7 +55,7 @@ object TestRow { | |||
} | |||
|
|||
// Run as follows: sbt "spark/testOnly -- -n jointest" | |||
class JoinTest extends AnyFlatSpec with TaggedFilterSuite { | |||
class JoinTest extends AnyFlatSpec { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Consider keeping TaggedFilterSuite.
Removing TaggedFilterSuite might affect the ability to filter and run specific tests using tags.
🏁 Script executed:
#!/bin/bash
# Check if TaggedFilterSuite is used in other test files
rg "extends.*TaggedFilterSuite" --type scala
Length of output: 485
TaggedFilterSuite Usage Inconsistency
- JoinTest removed TaggedFilterSuite, yet other tests (CatalystUtilHiveUDFTest, FetcherTest, MutationsTest) still use it.
- Reintroducing TaggedFilterSuite in JoinTest is recommended to keep test filtering consistent.
Summary
Allowing setting an explicit partition column in the Query, and in StagingQuery
Checklist
Summary by CodeRabbit
New Features
Refactor
Tests