-
Notifications
You must be signed in to change notification settings - Fork 0
[WIP] Drift metrics #59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe changes in this pull request introduce new methods and classes aimed at enhancing data processing and analysis, particularly focusing on JSON handling and drift analysis. Key modifications include the addition of a pretty-printing JSON functionality in Changes
Possibly related PRs
Suggested reviewers
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (7)
api/src/main/scala/ai/chronon/api/ThriftJsonCodec.scala (1)
52-57
: Consider optimizing the pretty-printing implementationWhile the implementation is functional, there are a few suggestions for improvement:
- The current implementation performs double conversion (string → JsonElement → pretty string). Consider creating a dedicated pretty-printing serializer to avoid this overhead.
- Add error handling for potential JSON parsing failures during the pretty-printing process.
Here's a suggested improvement:
@transient private lazy val prettyGson = new GsonBuilder().setPrettyPrinting().create() + @transient private lazy val prettySerializer: ThreadLocal[TSerializer] = new ThreadLocal[TSerializer] { + override def initialValue(): TSerializer = new TSerializer(new TSimpleJSONProtocol.Factory()) + } def toPrettyJsonStr[T <: TBase[_, _]: Manifest](obj: T): String = { - val raw = toJsonStr(obj) - val je = prettyGson.fromJson(raw, classOf[com.google.gson.JsonElement]) - prettyGson.toJson(je) + try { + val raw = new String(prettySerializer.get().serialize(obj), Constants.UTF8) + val je = prettyGson.fromJson(raw, classOf[com.google.gson.JsonElement]) + prettyGson.toJson(je) + } catch { + case e: Exception => + logger.error(s"Failed to pretty-print object: ${e.getMessage}") + toJsonStr(obj) // Fallback to regular JSON string + } }spark/src/main/scala/ai/chronon/spark/stats/drift/DistanceMetrics.scala (3)
Line range hint
39-41
: Remove hardcoded normalization value in KS distance calculation.The hardcoded value of 200.0 for normalization makes assumptions about the data range. This could lead to incorrect normalization if the actual data range differs.
Consider deriving the normalization factor from the actual data range:
-def kolmogorovSmirnovDistance(p: Seq[Double], q: Seq[Double]): Double = { - p.zip(q).map { case (pi, qi) => abs(pi - qi) }.max / 200.0 // Divide by range (200) to normalize +def kolmogorovSmirnovDistance(p: Seq[Double], q: Seq[Double]): Double = { + val range = (p ++ q).max - (p ++ q).min + p.zip(q).map { case (pi, qi) => abs(pi - qi) }.max / range }
124-139
: Improve documentation and variable naming in pattern matching.While the pattern matching improves readability, consider these enhancements:
- equalityHits match { - case eh if eh == 0 => interval / (percentiles(i) - percentiles(i - 1)) + // Handle different cases of matching percentile values at break points + equalityHits match { + case 0 => // No exact matches at break point + interval / (percentiles(i) - percentiles(i - 1)) - case eh if eh == 1 => + case 1 => // Single exact match at break point if (i == percentiles.length) { interval / (percentiles(i - 1) - percentiles(i - 2)) } else { interval / (percentiles(i) - percentiles(i - 1)) } - case eh if eh > 1 => + case hits if hits > 1 => // Multiple exact matches (point-mass) // point-mass case: when two or more continuous percentile points, pct(i), pct(i+1), pct(i+2) // have the same value as the break - (equalityHits - 1) * interval + (hits - 1) * interval }
Line range hint
71-82
: Move test code from main method to proper unit tests.The TODO comment indicates this code should be moved to unit tests. This would improve test organization and coverage.
Would you like me to help create proper unit tests for these drift metrics? I can:
- Generate test cases for each metric
- Include edge cases and boundary conditions
- Add proper assertions and test documentation
spark/src/main/scala/ai/chronon/spark/SparkSessionBuilder.scala (2)
81-81
: LGTM! Consider some improvements for robustness.The addition of
warehouseId
to prevent directory conflicts is good. However, consider these improvements:
- Sanitize the
name
parameter to prevent path issues- Implement directory cleanup on session termination
- Consider using
java.nio.file.Paths.get()
for robust path constructionHere's a suggested improvement:
- .config("spark.local.dir", s"/tmp/$userName/${name}_$warehouseId") + .config("spark.local.dir", Paths.get("/tmp", userName, s"${name.replaceAll("[^a-zA-Z0-9-_]", "_")}_$warehouseId").toString)Also, consider adding a cleanup mechanism:
// Add to the build method before returning the spark session spark.sparkContext.addSparkListener(new SparkListener { override def onApplicationEnd(applicationEnd: SparkListenerApplicationEnd): Unit = { val localDir = new File(s"/tmp/$userName/${name}_$warehouseId") if (localDir.exists()) { org.apache.commons.io.FileUtils.deleteDirectory(localDir) } } })
Line range hint
28-28
: Consider session-specific warehouse IDs.Currently,
warehouseId
is defined at the object level, meaning all sessions created by this builder instance will share the same ID. Consider moving it to thebuild
method for true session isolation.- private val warehouseId = java.util.UUID.randomUUID().toString.takeRight(6) def build(...): SparkSession = { + val sessionId = java.util.UUID.randomUUID().toString.takeRight(6) ... - .config("spark.local.dir", s"/tmp/$userName/${name}_$warehouseId") + .config("spark.local.dir", s"/tmp/$userName/${name}_$sessionId")spark/src/main/scala/ai/chronon/spark/LabelJoin.scala (1)
100-103
: LGTM! Variable rename improves code clarity.The rename from
labelTable
tolabelDf
better reflects that we're working with Spark DataFrames. This change enhances code readability by using more precise terminology.Consider adding a brief comment above the
labelDf
declaration to document what this DataFrame represents and its schema structure, which would help future maintainers understand the data flow.Also applies to: 122-122
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
api/src/main/scala/ai/chronon/api/ThriftJsonCodec.scala
(2 hunks)spark/src/main/scala/ai/chronon/spark/LabelJoin.scala
(2 hunks)spark/src/main/scala/ai/chronon/spark/SparkSessionBuilder.scala
(1 hunks)spark/src/main/scala/ai/chronon/spark/stats/drift/DistanceMetrics.scala
(2 hunks)
🔇 Additional comments (5)
api/src/main/scala/ai/chronon/api/ThriftJsonCodec.scala (2)
28-28
: LGTM: Import addition is appropriate
The addition of GsonBuilder import is well-placed and necessary for the new pretty-printing functionality.
52-57
: Verify the necessity of pretty-printing for drift metrics
Given that this PR is titled "Drift metrics", let's verify if this pretty-printing functionality is actually needed for the drift metrics implementation.
spark/src/main/scala/ai/chronon/spark/stats/drift/DistanceMetrics.scala (2)
Line range hint 1-7
: LGTM! Good package organization.
The move to a dedicated drift
package and renaming to DriftMetrics
improves code organization and better reflects the component's purpose.
111-112
: LGTM! Good input validation.
The added requirement check helps prevent invalid input data.
spark/src/main/scala/ai/chronon/spark/SparkSessionBuilder.scala (1)
81-81
: Verify the impact on concurrent Spark sessions.
Let's verify that this change properly handles concurrent Spark sessions and doesn't cause any directory conflicts.
✅ Verification successful
Let me gather more specific information about the temp directory handling.
Concurrent Spark sessions are properly handled with unique directory paths
The implementation safely handles concurrent sessions through:
- Unique directory paths using UUID for each warehouse:
warehouseId = java.util.UUID.randomUUID().toString.takeRight(6)
- User-specific directories:
/tmp/$userName/
- Session-specific subdirectories with unique names:
${name}_$warehouseId
These mechanisms ensure that:
- Each Spark session gets its own isolated local directory
- Different users have separate spaces
- Multiple sessions from the same user don't conflict
- Temporary files are properly segregated
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for potential directory conflicts or issues with concurrent sessions
# Test 1: Look for any hardcoded spark.local.dir configurations that might conflict
rg -l 'spark\.local\.dir'
# Test 2: Check for any cleanup mechanisms or related SparkListener implementations
ast-grep --pattern 'class $_ extends SparkListener {
$$$
}'
# Test 3: Look for any existing temp directory cleanup logic
rg -l 'deleteDirectory|cleanUp|cleanup|delete.*Directory'
Length of output: 581
Script:
#!/bin/bash
# Check the actual implementation of spark.local.dir configuration and related cleanup
rg -A 10 'spark\.local\.dir' spark/src/main/scala/ai/chronon/spark/SparkSessionBuilder.scala
# Check cleanup implementations in LocalTableExporter
rg -A 10 'deleteDirectory|cleanUp|cleanup|delete.*Directory' spark/src/main/scala/ai/chronon/spark/LocalTableExporter.scala
# Check if there are any other session management or cleanup related code
rg -A 5 'session|Session' spark/src/main/scala/ai/chronon/spark/SparkSessionBuilder.scala
# Look for any concurrent session handling mechanisms
ast-grep --pattern 'def $name($$$) = {
$$$
SparkSession
$$$
}'
Length of output: 3774
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
.github/workflows/test_scala_and_python.yaml
(2 hunks)api/src/main/scala/ai/chronon/api/ColorPrinter.scala
(0 hunks)
💤 Files with no reviewable changes (1)
- api/src/main/scala/ai/chronon/api/ColorPrinter.scala
🧰 Additional context used
🪛 actionlint
.github/workflows/test_scala_and_python.yaml
52-52: label "large" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-14.0", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-13.0", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "macos-12.0", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
71-71: label "large" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-14.0", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-13.0", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "macos-12.0", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
🔇 Additional comments (1)
.github/workflows/test_scala_and_python.yaml (1)
Line range hint 52-90
: Verify resource requirements for Spark tests
Both modified jobs use significant JVM memory (-Xmx8G
). Please ensure the selected runner has sufficient resources to handle these memory requirements.
Consider:
- Documenting the resource requirements in the workflow file
- Setting up monitoring to track memory usage during test runs
- Implementing test segmentation if memory usage is consistently high
Also applies to: 71-90
✅ Verification successful
The previous output showed consistent memory settings (-Xmx8G) across multiple test jobs. Let's verify the runner specifications and any available documentation about resource requirements before making a final determination.
Memory requirements aligned with runner specifications
The verification shows that jobs with high memory requirements (-Xmx8G) are appropriately configured to run on the large
runner, while other jobs use ubuntu-latest
. This indicates proper resource allocation awareness in the workflow design:
- Jobs with intensive Spark tests use the
large
runner - Other test jobs use the standard
ubuntu-latest
runner - Memory settings are consistently set to 8GB max heap across all jobs
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check memory settings across Spark test jobs
# Look for memory-related configurations in the workflow
rg -A 2 "SBT_OPTS.*Xmx" .github/workflows/test_scala_and_python.yaml
# Look for Spark memory configurations in Scala tests
rg "spark\..*memory" --type scala
Length of output: 983
Script:
#!/bin/bash
# Check the runner specifications in the workflow file
rg "runs-on:" -A 2 .github/workflows/test_scala_and_python.yaml
# Check if there are any resource-related configurations or documentation
rg -i "memory|resource|runner" .github/workflows/README.md || true
# Look for any GitHub Actions workflow documentation in the repository
fd README.md .github/
Length of output: 911
🧰 Tools
🪛 actionlint
71-71: label "large" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-14.0", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-13.0", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "macos-12.0", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
.github/workflows/test_scala_and_python.yaml
(2 hunks)api/src/main/scala/ai/chronon/api/ColorPrinter.scala
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- api/src/main/scala/ai/chronon/api/ColorPrinter.scala
🧰 Additional context used
🪛 actionlint
.github/workflows/test_scala_and_python.yaml
52-52: label "large" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-14.0", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-13.0", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "macos-12.0", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
71-71: label "large2" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-14.0", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-13.0", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "macos-12.0", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
🔇 Additional comments (1)
.github/workflows/test_scala_and_python.yaml (1)
52-52
: Previous review comment is still applicable.
The issue with the invalid runner label 'large' has been previously identified and the suggested fix remains valid.
🧰 Tools
🪛 actionlint
52-52: label "large" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-14.0", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-13.0", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "macos-12.0", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
.github/workflows/test_scala_and_python.yaml
(2 hunks)
🧰 Additional context used
🪛 actionlint
.github/workflows/test_scala_and_python.yaml
52-52: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-14.0", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-13.0", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "macos-12.0", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
🔇 Additional comments (1)
.github/workflows/test_scala_and_python.yaml (1)
67-67
: Verify memory requirements for Spark tests
The SBT memory settings have been significantly increased from 8G to 24G. While this aligns with the intention to use more resources, please verify:
- Whether these tests actually require such high memory
- If this could cause issues when running locally
- If all CI environments can support this memory allocation
✅ Verification successful
Based on the workflow file content, I can now provide a final response:
Memory settings are appropriately configured for different test suites
The increased memory (24G) is specifically allocated to the other_spark_tests
job running on a dedicated high-resource runner (ubuntu-8_cores-32_gb
). Other Spark test jobs (join, mutation, fetcher) use a lower 8G configuration on standard runners. This split configuration suggests:
- The main Spark test suite requires more resources and runs on appropriate infrastructure
- Other specialized test suites are optimized with lower memory requirements
- The CI environment explicitly supports these allocations with the high-resource runner
Local development isn't impacted as these settings are CI-specific. The configuration is well-balanced across different test categories based on their resource needs.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check memory usage patterns in Spark tests
# Look for memory-intensive test configurations
rg -A 5 "spark.executor.memory|spark.driver.memory|spark.memory" --type scala
# Find potential memory-intensive operations in test files
ast-grep --pattern 'class $_ extends $_ {
$$$
test($_, $_) {
$$$
collect($$$)
$$$
}
$$$
}'
Length of output: 192
Script:
#!/bin/bash
# Let's try a different approach to analyze memory usage patterns
# Check for large test data files or data generation in tests
rg -g "*/test/*" -i "generate.*data|large.*dataset" --type scala
# Look for test configurations and settings
fd "application.conf|reference.conf" --exec cat {}
# Check commit history for memory-related changes
git log -p --grep="memory\|heap\|OOM" .github/workflows/test_scala_and_python.yaml
# Look for spark session configurations in test files
rg "SparkSession|sparkContext" --type scala -A 5 -g "*/test/*"
Length of output: 1413
Script:
#!/bin/bash
# Let's check the workflow file history and current state
# Show the current workflow file content
cat .github/workflows/test_scala_and_python.yaml
# Check recent changes to the workflow file
git log -p .github/workflows/test_scala_and_python.yaml -n 3
# List all scala test files to understand the test suite size
fd ".scala" --type f --glob "**/test/**"
# Check for any OOM or memory-related issues in recent commits
git log --grep="OOM\|memory\|heap" -i --since="6 months ago"
Length of output: 5150
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Outside diff range and nitpick comments (14)
online/src/test/scala/ai/chronon/online/test/stats/AssignIntervalsTest.scala (5)
7-9
: Remove extra empty line.There's an unnecessary extra empty line after the class declaration.
class AssignIntervalsTest extends AnyFlatSpec with Matchers { -
10-10
: Improve test description for clarity.The current description "assign weights into intervals between breaks" is too vague. Consider making it more specific about what's being tested and the expected behavior.
- "assignment" should "assign weights into intervals between breaks" in { + "AssignIntervals.on" should "correctly distribute weights between breakpoints including edge cases with repeated values" in {
11-13
: Improve test data readability.The input arrays could be formatted more clearly to show the relationship between values and their expected grouping.
- val ptiles = Array( 1, 4, 6,6,6, 8, 9 ) - val breaks = Array(0, 1, 2, 3, 5, 6, 7, 8, 9, 10) + val ptiles = Array( + 1, // falls in first interval + 4, // falls in middle + 6,6,6,// repeated values + 8, 9 // upper range + ) + val breaks = Array( + 0, 1, 2, 3, 5, 6, 7, 8, 9, 10 // breakpoints defining intervals + )
14-14
: Remove commented-out code.The unused
interval
variable should be removed if it's not needed.- //val interval = 0.25
10-23
: Add test documentation.The test lacks documentation explaining the purpose and the mathematical reasoning behind the expected values. This would help maintainers understand the test case better.
Add a comment block explaining:
- The purpose of the test
- How the expected values are calculated
- The significance of edge cases (repeated values)
+ /** + * Tests the interval assignment logic with various edge cases: + * - Single value in an interval (1 in [0,1]) + * - Multiple repeated values (6,6,6) + * - Values near interval boundaries + * + * Expected values represent the weighted distribution of points + * between consecutive breakpoints. For example: + * - 1/3 represents one point distributed across 3 possible intervals + * - 2.5 represents multiple points (the three 6's) distributed + * between two intervals + */ "assignment" should "assign weights into intervals between breaks" in {online/src/main/scala/ai/chronon/online/stats/AssignIntervals.scala (4)
16-16
: Consider renaming the methodon
to a more descriptive nameThe method name
on
may not clearly convey its purpose. Renaming it to something more descriptive, such asassignIntervals
orcomputeIntervalAssignments
, would enhance readability and maintainability.
17-21
: Use more descriptive variable names for clarityVariables like
i
,bIndex
,bLen
, andpLen
could be renamed to more descriptive names to improve code readability. For example:
i
→ptileIndex
bIndex
→breakIndex
bLen
→breaksLength
pLen
→ptilesLength
Apply this diff to rename the variables:
-var i = 0 +var ptileIndex = 0 var prev = breaks.head -var bIndex = 1 +var breakIndex = 1 -var bLen = breaks.length +val breaksLength = breaks.length -var pLen = ptiles.length +val ptilesLength = ptiles.length val arr = Array.fill(bLen - 1)(0.0)Ensure to update all occurrences of these variables throughout the method.
16-57
: Refactor to use immutable variables and functional programming styleThe current implementation uses mutable variables (
var
). Refactoring to use immutable variables and functional constructs can improve code clarity and reduce side effects.For example, consider using recursive functions or higher-order functions like
foldLeft
to eliminate mutable state.
16-57
: Add comprehensive unit tests to cover edge casesTo ensure robustness, add unit tests that cover various edge cases, such as:
- Empty
ptiles
orbreaks
arraysptiles
with consecutive equal valuesbreaks
outside the range ofptiles
- Single-element arrays
- Extremely large or small values
Would you like assistance in writing these unit tests?
online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala (5)
119-127
: Refactor to avoid mutable variables for better functional style.The use of mutable variables
i
andequalityHits
in thecomputePDF
method can be refactored to a more functional style, enhancing readability and aligning with Scala best practices.Consider refactoring using immutable data structures and functional constructs:
def computePDF(percentiles: Array[Double], breaks: Array[Double]): Array[Mass] = { val n = percentiles.length require(n > 2, "Need at least 3 percentiles to plot a distribution") val interval: Double = 1.0 / (n - 1) breaks.map { break => val indices = percentiles.indices.filter(i => percentiles(i) <= break) val equalityHits = indices.count(i => percentiles(i) == break) val i = indices.lastOption.getOrElse(0) val massValue = (i, equalityHits) match { case (0, _) => 0.0 // before range case (x, 0) if x >= n => 0.0 // after range case (_, e) if e > 1 => (e - 1) * interval // point mass case (x, 1) if x == n => interval / (percentiles(n - 1) - percentiles(n - 2)) // exactly at end case (x, _) => interval / (percentiles(x) - percentiles(x - 1)) // in between } Mass(massValue, equalityHits > 1) } }
42-47
: Enhance documentation forfunctionBuilder
method.The
functionBuilder
method is a higher-order function combining binning and distance functions. Adding scaladoc comments with usage examples would improve understandability.Consider adding documentation like:
/** * Constructs a distance computation function from a binning function and a distance function. * * @param binningFunc Function to bin the comparison data into distributions. * @param distanceFunc Function to compute the distance between distributions. * @tparam T Type of the data being compared. * @return A function that computes the distance given a Comparison[T]. */ def functionBuilder[T](binningFunc: Comparison[T] => Distributions, distanceFunc: Distributions => Double): Comparison[T] => Double = { c => val dists = binningFunc(c) distanceFunc(dists) }
40-41
: Clarify generic type constraints forComparison[T]
.The
Comparison[T]
case class uses a generic typeT
without constraints. Specifying bounds or adding type documentation would help prevent misuse.Consider specifying type requirements or documenting expected types.
case class Comparison[T](previous: T, current: T, timeDelta: Window)
49-140
: Add input validation for empty or unequal length arrays.The distance functions assume that input arrays are non-empty and of equal length. Adding validation checks will prevent unexpected runtime errors.
Apply this diff to add validation:
def hellingerDistance(p: Array[Double], q: Array[Double]): Double = { + require(p.nonEmpty && q.nonEmpty, "Input arrays must be non-empty.") + require(p.length == q.length, s"Input arrays must have the same length: p.length=${p.length}, q.length=${q.length}") val (pProbs, qProbs) = computePDFs(p, q) // ... }Repeat similar checks for
populationStabilityIndex
andjensenShannonDivergence
.
6-141
: Implement unit tests for distance metric functions.Currently, there are no unit tests for the distance metric functions, which are critical for ensuring correctness.
Do you want assistance in creating unit tests for these functions? Proper unit tests will help in catching edge cases and validating the implementation.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
online/src/main/scala/ai/chronon/online/stats/AssignIntervals.scala
(1 hunks)online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala
(1 hunks)online/src/test/scala/ai/chronon/online/test/stats/AssignIntervalsTest.scala
(1 hunks)spark/src/main/scala/ai/chronon/spark/stats/drift/DistanceMetrics.scala
(0 hunks)
💤 Files with no reviewable changes (1)
- spark/src/main/scala/ai/chronon/spark/stats/drift/DistanceMetrics.scala
🔇 Additional comments (1)
online/src/test/scala/ai/chronon/online/test/stats/AssignIntervalsTest.scala (1)
1-6
: LGTM! Clean and minimal imports.
The package structure and imports are well-organized and contain only the necessary dependencies.
expected.zip(result).foreach{ | ||
case (e, r) => println(s"exp: $e res: $r") | ||
r shouldEqual e | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Remove debug print statements and improve assertion readability.
Production tests should not contain debug print statements. Also, the assertion could be more readable using ScalaTest's built-in collection matchers.
- expected.zip(result).foreach{
- case (e, r) => println(s"exp: $e res: $r")
- r shouldEqual e
- }
+ result should have length expected.length
+ result shouldBe expected
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
expected.zip(result).foreach{ | |
case (e, r) => println(s"exp: $e res: $r") | |
r shouldEqual e | |
} | |
result should have length expected.length | |
result shouldBe expected |
def on(ptiles: Array[Double], breaks: Array[Double]): Array[Double] = { | ||
var i = 0 | ||
var prev = breaks.head | ||
var bIndex = 1 | ||
val bLen = breaks.length | ||
val pLen = ptiles.length | ||
val arr = Array.fill(bLen - 1)(0.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add input validation for empty arrays
Currently, the method assumes that ptiles
and breaks
arrays are non-empty. Accessing breaks.head
and breaks(0)
without checks can lead to exceptions if breaks
is empty. Similarly, operations on ptiles
may fail if it's empty. Add input validation to handle empty arrays gracefully.
Apply this diff to add input validation:
def on(ptiles: Array[Double], breaks: Array[Double]): Array[Double] = {
+ if (breaks.isEmpty || ptiles.isEmpty) {
+ throw new IllegalArgumentException("Input arrays 'breaks' and 'ptiles' must not be empty")
+ }
var i = 0
var prev = breaks.head
var bIndex = 1
val bLen = breaks.length
val pLen = ptiles.length
val arr = Array.fill(bLen - 1)(0.0)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def on(ptiles: Array[Double], breaks: Array[Double]): Array[Double] = { | |
var i = 0 | |
var prev = breaks.head | |
var bIndex = 1 | |
val bLen = breaks.length | |
val pLen = ptiles.length | |
val arr = Array.fill(bLen - 1)(0.0) | |
def on(ptiles: Array[Double], breaks: Array[Double]): Array[Double] = { | |
if (breaks.isEmpty || ptiles.isEmpty) { | |
throw new IllegalArgumentException("Input arrays 'breaks' and 'ptiles' must not be empty") | |
} | |
var i = 0 | |
var prev = breaks.head | |
var bIndex = 1 | |
val bLen = breaks.length | |
val pLen = ptiles.length | |
val arr = Array.fill(bLen - 1)(0.0) |
// fraction of the equally spaced interval between p(i-1) to p(i) to assign to this break | ||
val fraction = | ||
if (i == 0) 0.0 | ||
else if (ptiles(i) == ptiles(i - 1)) 1.0 | ||
else (ptiles(i) - prev) / (ptiles(i) - ptiles(i - 1)) | ||
|
||
result += fraction | ||
prev = ptiles(i) | ||
i += 1 | ||
} | ||
|
||
// residual fraction to assign | ||
val fraction = | ||
if (i <= 0 || i >= pLen) 0.0 | ||
else (b - prev) / (ptiles(i) - ptiles(i - 1)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Protect against potential division by zero in fractional calculations
In the calculations at lines 39 and 49, there's a risk of division by zero if ptiles(i) - ptiles(i - 1)
equals zero due to floating-point precision errors. Even though you check for exact equality at line 38, using a small epsilon value for comparison accounts for floating-point inaccuracies.
Modify the comparisons to include an epsilon:
val fraction =
if (i == 0) 0.0
- else if (ptiles(i) == ptiles(i - 1)) 1.0
+ else if (math.abs(ptiles(i) - ptiles(i - 1)) < 1e-10) 1.0
else (ptiles(i) - prev) / (ptiles(i) - ptiles(i - 1))
Similarly, adjust the calculation at line 49:
val fraction =
if (i <= 0 || i >= pLen) 0.0
+ else if (math.abs(ptiles(i) - ptiles(i - 1)) < 1e-10) 1.0
else (b - prev) / (ptiles(i) - ptiles(i - 1))
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
// fraction of the equally spaced interval between p(i-1) to p(i) to assign to this break | |
val fraction = | |
if (i == 0) 0.0 | |
else if (ptiles(i) == ptiles(i - 1)) 1.0 | |
else (ptiles(i) - prev) / (ptiles(i) - ptiles(i - 1)) | |
result += fraction | |
prev = ptiles(i) | |
i += 1 | |
} | |
// residual fraction to assign | |
val fraction = | |
if (i <= 0 || i >= pLen) 0.0 | |
else (b - prev) / (ptiles(i) - ptiles(i - 1)) | |
// fraction of the equally spaced interval between p(i-1) to p(i) to assign to this break | |
val fraction = | |
if (i == 0) 0.0 | |
else if (math.abs(ptiles(i) - ptiles(i - 1)) < 1e-10) 1.0 | |
else (ptiles(i) - prev) / (ptiles(i) - ptiles(i - 1)) | |
result += fraction | |
prev = ptiles(i) | |
i += 1 | |
} | |
// residual fraction to assign | |
val fraction = | |
if (i <= 0 || i >= pLen) 0.0 | |
else if (math.abs(ptiles(i) - ptiles(i - 1)) < 1e-10) 1.0 | |
else (b - prev) / (ptiles(i) - ptiles(i - 1)) |
def klDivergence(p: Array[Double], q: Array[Double]): Double = { | ||
require(p.length == q.length, s"Inputs are of different length ${p.length}, ${q.length}") | ||
var i = 0 | ||
var result = 0.0 | ||
while (i < p.length) { | ||
val inc = if (p(i) > 0 && q(i) > 0) p(i) * math.log(p(i) / q(i)) else 0 | ||
result += inc | ||
i += 1 | ||
} | ||
result | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct the handling of zero probabilities in KL divergence calculation.
In the klDivergence
function, when q(i)
is zero and p(i)
is not zero, the divergence should approach infinity, as logarithm of zero is undefined. The current implementation sets the increment to zero in such cases, which is mathematically inaccurate.
Apply this diff to properly handle zero probabilities:
def klDivergence(p: Array[Double], q: Array[Double]): Double = {
require(p.length == q.length, s"Inputs are of different length ${p.length}, ${q.length}")
var i = 0
var result = 0.0
while (i < p.length) {
- val inc = if (p(i) > 0 && q(i) > 0) p(i) * math.log(p(i) / q(i)) else 0
+ val inc = (p(i), q(i)) match {
+ case (pi, qi) if pi > 0 && qi > 0 =>
+ pi * math.log(pi / qi)
+ case (pi, 0) if pi > 0 =>
+ Double.PositiveInfinity
+ case _ => 0.0
+ }
result += inc
i += 1
}
result
}
Alternatively, consider adding a small epsilon to q(i)
to prevent division by zero:
val epsilon = 1e-10
val adjustedQi = q(i) + epsilon
// TODO move this to unit test | ||
def main(args: Array[String]): Unit = { | ||
val A = Array(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200).map( | ||
_.toDouble) | ||
val B = Array(5, 15, 25, 35, 45, 55, 65, 75, 85, 95, 115, 115, 115, 135, 145, 155, 165, 1175, 1205, 1205, 1205).map( | ||
_.toDouble) | ||
|
||
val jsd = jensenShannonDivergence(A, B) | ||
println(f"The Jensen-Shannon Divergence between distributions A and B is: $jsd%.5f") | ||
val psi = populationStabilityIndex(A, B) | ||
println(f"The Population Stability Index between distributions A and B is: $psi%.5f") | ||
val hd = hellingerDistance(A, B) | ||
println(f"The Hellinger Distance between distributions A and B is: $hd%.5f") | ||
|
||
// format: off | ||
// aligned vertically for easier reasoning | ||
val ptiles = Array( 1, 4, 6,6,6, 8, 9 ) | ||
val breaks = Array(0, 1, 2, 3, 5, 6, 7, 8, 9, 10) | ||
// format: on | ||
|
||
//val interval = 0.25 | ||
val expected = Array(0.0, 1.0 / 3.0, 1.0 / 3.0, (1.0) / (3.0) + (1.0) / (2.0), (1.0) / (2.0), 2.5, 0.5, 1, 0) | ||
|
||
val result = AssignIntervals.on(ptiles = ptiles.map(_.toDouble), breaks = breaks.map(_.toDouble)) | ||
|
||
expected.zip(result).foreach { case (e, r) => println(s"exp: $e res: $r") } | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Move test code from main
method to dedicated unit tests.
The main
method is being used for testing purposes, as indicated by the TODO comment on line 8. It's best practice to move test code into unit tests to enhance maintainability and allow automated testing frameworks to pick them up.
Apply this diff to remove the main
method and consider implementing unit tests instead:
- // TODO move this to unit test
- def main(args: Array[String]): Unit = {
- val A = Array(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200).map(
- _.toDouble)
- val B = Array(5, 15, 25, 35, 45, 55, 65, 75, 85, 95, 115, 115, 115, 135, 145, 155, 165, 1175, 1205, 1205, 1205).map(
- _.toDouble)
-
- val jsd = jensenShannonDivergence(A, B)
- println(f"The Jensen-Shannon Divergence between distributions A and B is: $jsd%.5f")
- val psi = populationStabilityIndex(A, B)
- println(f"The Population Stability Index between distributions A and B is: $psi%.5f")
- val hd = hellingerDistance(A, B)
- println(f"The Hellinger Distance between distributions A and B is: $hd%.5f")
-
- // Additional testing code...
- }
Committable suggestion skipped: line range outside the PR's diff.
def indexMass(i: Int): Double = interval / (percentiles(i) - percentiles(i - 1)) | ||
val isPointMass = eh > 1 | ||
val m = (i, eh) match { | ||
case (0, _) => 0.0 // before range | ||
case (x, 0) if x >= n => 0.0 // after range | ||
case (_, e) if e > 1 => (e - 1) * interval // point mass | ||
case (x, 1) if x == n => indexMass(n - 1) // exactly at end of range | ||
case (x, _) => indexMass(x) // somewhere in between | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prevent division by zero in indexMass
function.
In indexMass
, there's a potential for division by zero when percentiles(i) == percentiles(i - 1)
. This would cause a ArithmeticException
.
Apply this diff to handle the edge case:
def indexMass(i: Int): Double = {
val denominator = percentiles(i) - percentiles(i - 1)
- interval / denominator
+ if (denominator != 0) interval / denominator else 0.0
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def indexMass(i: Int): Double = interval / (percentiles(i) - percentiles(i - 1)) | |
val isPointMass = eh > 1 | |
val m = (i, eh) match { | |
case (0, _) => 0.0 // before range | |
case (x, 0) if x >= n => 0.0 // after range | |
case (_, e) if e > 1 => (e - 1) * interval // point mass | |
case (x, 1) if x == n => indexMass(n - 1) // exactly at end of range | |
case (x, _) => indexMass(x) // somewhere in between | |
} | |
def indexMass(i: Int): Double = { | |
val denominator = percentiles(i) - percentiles(i - 1) | |
if (denominator != 0) interval / denominator else 0.0 | |
} | |
val isPointMass = eh > 1 | |
val m = (i, eh) match { | |
case (0, _) => 0.0 // before range | |
case (x, 0) if x >= n => 0.0 // after range | |
case (_, e) if e > 1 => (e - 1) * interval // point mass | |
case (x, 1) if x == n => indexMass(n - 1) // exactly at end of range | |
case (x, _) => indexMass(x) // somewhere in between | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (1)
online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala (1)
74-101
: Optimize categorical distance calculation.The current implementation iterates over both histograms separately. Consider using a more efficient approach with a single pass over the combined keyset.
Apply this diff:
def categoricalDistance(a: Histogram, b: Histogram, metric: DriftMetric): Double = { - val aIt = a.entrySet().iterator() var result = 0.0 val func = termFunc(metric) - while (aIt.hasNext) { - val entry = aIt.next() - val key = entry.getKey - val aVal = entry.getValue.toDouble - val bValOpt = b.get(key) - val bVal = if (bValOpt == null) 0.0 else bValOpt.toDouble - val term = func(aVal, bVal) - result += term - } - - val bIt = b.entrySet().iterator() - while (bIt.hasNext) { - val entry = bIt.next() - val key = entry.getKey - val bVal = entry.getValue.toDouble - val aValOpt = a.get(key) - if (aValOpt == null) { - result += func(0.0, bVal) - } - } + val allKeys = (a.keySet() ++ b.keySet()).asScala + allKeys.foreach { key => + val aVal = Option(a.get(key)).map(_.toDouble).getOrElse(0.0) + val bVal = Option(b.get(key)).map(_.toDouble).getOrElse(0.0) + result += func(aVal, bVal) + } result }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala
(1 hunks)
🔇 Additional comments (2)
online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala (2)
10-23
: Move test code to dedicated unit tests.
The existing TODO comment and test code in the main method should be moved to proper unit tests.
104-106
: Correct the handling of zero probabilities in KL divergence calculation.
The current implementation of klDivergenceTerm doesn't properly handle the case where q is zero and p is non-zero.
val key = entry.getKey | ||
val aVal = entry.getValue.toDouble | ||
val bValOpt = b.get(key) | ||
val bVal = if (bValOpt == null) bValOpt.toDouble else 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix incorrect bVal calculation logic.
The bVal calculation is incorrect. It's using the null value instead of the actual value.
Apply this diff:
- val bVal = if (bValOpt == null) bValOpt.toDouble else 0.0
+ val bVal = if (bValOpt == null) 0.0 else bValOpt.toDouble
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
val bVal = if (bValOpt == null) bValOpt.toDouble else 0.0 | |
val bVal = if (bValOpt == null) 0.0 else bValOpt.toDouble |
private def normalizeInplace(arr: Array[Double]): Array[Double] = { | ||
val sum = arr.sum | ||
var i = 0 | ||
while (i < arr.length) { | ||
arr.update(i, arr(i) / sum) | ||
i += 1 | ||
} | ||
arr | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle edge cases in normalizeInplace function.
The function has two potential issues:
- No handling of zero sum case which could lead to division by zero
- Modifies array in-place but this isn't clear from the function name
Apply this diff to improve the implementation:
- private def normalizeInplace(arr: Array[Double]): Array[Double] = {
+ private def normalizeArrayInplace(arr: Array[Double]): Array[Double] = {
val sum = arr.sum
+ if (sum == 0.0) return arr
var i = 0
while (i < arr.length) {
arr.update(i, arr(i) / sum)
i += 1
}
arr
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
private def normalizeInplace(arr: Array[Double]): Array[Double] = { | |
val sum = arr.sum | |
var i = 0 | |
while (i < arr.length) { | |
arr.update(i, arr(i) / sum) | |
i += 1 | |
} | |
arr | |
} | |
private def normalizeArrayInplace(arr: Array[Double]): Array[Double] = { | |
val sum = arr.sum | |
if (sum == 0.0) return arr | |
var i = 0 | |
while (i < arr.length) { | |
arr.update(i, arr(i) / sum) | |
i += 1 | |
} | |
arr | |
} |
def percentileDistance(a: Array[Double], b: Array[Double], metric: DriftMetric): Double = { | ||
val breaks = (a ++ b).sorted.distinct | ||
val aProjected = AssignIntervals.on(a, breaks) | ||
val bProjected = AssignIntervals.on(b, breaks) | ||
|
||
val aNormalized = normalizeInplace(aProjected) | ||
val bNormalized = normalizeInplace(bProjected) | ||
|
||
val func = termFunc(metric) | ||
|
||
var i = 0 | ||
var result = 0.0 | ||
while (i < aNormalized.length) { | ||
result += func(aNormalized(i), bNormalized(i)) | ||
i += 1 | ||
} | ||
result | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add input validation in percentileDistance.
The method should validate input arrays before processing.
Apply this diff:
def percentileDistance(a: Array[Double], b: Array[Double], metric: DriftMetric): Double = {
+ require(a != null && b != null, "Input arrays cannot be null")
+ require(a.nonEmpty && b.nonEmpty, "Input arrays cannot be empty")
val breaks = (a ++ b).sorted.distinct
val aProjected = AssignIntervals.on(a, breaks)
val bProjected = AssignIntervals.on(b, breaks)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def percentileDistance(a: Array[Double], b: Array[Double], metric: DriftMetric): Double = { | |
val breaks = (a ++ b).sorted.distinct | |
val aProjected = AssignIntervals.on(a, breaks) | |
val bProjected = AssignIntervals.on(b, breaks) | |
val aNormalized = normalizeInplace(aProjected) | |
val bNormalized = normalizeInplace(bProjected) | |
val func = termFunc(metric) | |
var i = 0 | |
var result = 0.0 | |
while (i < aNormalized.length) { | |
result += func(aNormalized(i), bNormalized(i)) | |
i += 1 | |
} | |
result | |
} | |
def percentileDistance(a: Array[Double], b: Array[Double], metric: DriftMetric): Double = { | |
require(a != null && b != null, "Input arrays cannot be null") | |
require(a.nonEmpty && b.nonEmpty, "Input arrays cannot be empty") | |
val breaks = (a ++ b).sorted.distinct | |
val aProjected = AssignIntervals.on(a, breaks) | |
val bProjected = AssignIntervals.on(b, breaks) | |
val aNormalized = normalizeInplace(aProjected) | |
val bNormalized = normalizeInplace(bProjected) | |
val func = termFunc(metric) | |
var i = 0 | |
var result = 0.0 | |
while (i < aNormalized.length) { | |
result += func(aNormalized(i), bNormalized(i)) | |
i += 1 | |
} | |
result | |
} |
def distance(a: TileSummaries, b: TileSummaries, metric: DriftMetric): java.lang.Double = { | ||
require(a.isSetPercentiles == b.isSetPercentiles, "Percentiles should be either set or unset together") | ||
require(a.isSetHistogram == b.isSetHistogram, "Histograms should be either set or unset together") | ||
|
||
val isContinuous = a.isSetPercentiles && b.isSetPercentiles | ||
val isCategorical = a.isSetHistogram && b.isSetHistogram | ||
if (isContinuous) | ||
percentileDistance(toArray(a.getPercentiles), toArray(b.getPercentiles), metric) | ||
else if (isCategorical) | ||
categoricalDistance(a.getHistogram, b.getHistogram, metric) | ||
else | ||
null | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace null with Option in distance method.
Returning null is unsafe in Scala. Consider using Option instead.
Apply this diff:
- def distance(a: TileSummaries, b: TileSummaries, metric: DriftMetric): java.lang.Double = {
+ def distance(a: TileSummaries, b: TileSummaries, metric: DriftMetric): Option[Double] = {
require(a.isSetPercentiles == b.isSetPercentiles, "Percentiles should be either set or unset together")
require(a.isSetHistogram == b.isSetHistogram, "Histograms should be either set or unset together")
val isContinuous = a.isSetPercentiles && b.isSetPercentiles
val isCategorical = a.isSetHistogram && b.isSetHistogram
if (isContinuous)
- percentileDistance(toArray(a.getPercentiles), toArray(b.getPercentiles), metric)
+ Some(percentileDistance(toArray(a.getPercentiles), toArray(b.getPercentiles), metric))
else if (isCategorical)
- categoricalDistance(a.getHistogram, b.getHistogram, metric)
+ Some(categoricalDistance(a.getHistogram, b.getHistogram, metric))
else
- null
+ None
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def distance(a: TileSummaries, b: TileSummaries, metric: DriftMetric): java.lang.Double = { | |
require(a.isSetPercentiles == b.isSetPercentiles, "Percentiles should be either set or unset together") | |
require(a.isSetHistogram == b.isSetHistogram, "Histograms should be either set or unset together") | |
val isContinuous = a.isSetPercentiles && b.isSetPercentiles | |
val isCategorical = a.isSetHistogram && b.isSetHistogram | |
if (isContinuous) | |
percentileDistance(toArray(a.getPercentiles), toArray(b.getPercentiles), metric) | |
else if (isCategorical) | |
categoricalDistance(a.getHistogram, b.getHistogram, metric) | |
else | |
null | |
} | |
def distance(a: TileSummaries, b: TileSummaries, metric: DriftMetric): Option[Double] = { | |
require(a.isSetPercentiles == b.isSetPercentiles, "Percentiles should be either set or unset together") | |
require(a.isSetHistogram == b.isSetHistogram, "Histograms should be either set or unset together") | |
val isContinuous = a.isSetPercentiles && b.isSetPercentiles | |
val isCategorical = a.isSetHistogram && b.isSetHistogram | |
if (isContinuous) | |
Some(percentileDistance(toArray(a.getPercentiles), toArray(b.getPercentiles), metric)) | |
else if (isCategorical) | |
Some(categoricalDistance(a.getHistogram, b.getHistogram, metric)) | |
else | |
None | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (4)
online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala (4)
235-240
: Ensure match intermFunc
is exhaustiveThe
termFunc
function does not handle unexpectedDriftMetric
values, which could lead to aMatchError
if a new metric is introduced. Consider adding a default case to handle unsupported metrics.Apply this diff to make the match exhaustive:
def termFunc(d: DriftMetric): (Double, Double) => Double = d match { case DriftMetric.PSI => psiTerm case DriftMetric.HELLINGER => hellingerTerm case DriftMetric.JENSEN_SHANNON => jsdTerm + case _ => throw new IllegalArgumentException(s"Unsupported DriftMetric: $d") }
138-138
: Manage debug code appropriately before mergingThere are debug statements and comments indicating they should be removed before merging. Instead of removing, consider using a proper logging framework with configurable log levels to manage debug output effectively.
Implement logging to enable or disable debug output without code changes. Replace
println
statements with logger calls:+ import org.slf4j.LoggerFactory object DistanceMetrics { + private val logger = LoggerFactory.getLogger(DistanceMetrics.getClass) // TODO move this to unit test def main(args: Array[String]): Unit = { // ... // Debug code management if (debug) { - println(f""" + logger.debug(f""" |aProjected : ${printArr(aProjected)} |bProjected : ${printArr(bProjected)} |aNormalized: ${printArr(aNormalized)} |bNormalized: ${printArr(bNormalized)} |deltas : ${printArr(deltas)} |result : $result%.4f |""".stripMargin) } // ... } }Ensure you have the appropriate logging library included in your project's dependencies.
Also applies to: 146-146, 153-164
19-22
: Parameterize magic numbers for clarity inbuildPercentiles
The hardcoded probability points
0.01
and0.99
could be defined as constants or parameters to improve code readability and maintainability.Apply this diff to define constants:
def buildPercentiles(mean: Double, variance: Double, breaks: Int = 20): Array[Double] = { + val MIN_PROB = 0.01 + val MAX_PROB = 0.99 val stdDev = math.sqrt(variance) // Create probability points from 0.01 to 0.99 instead of 0 to 1 val probPoints = (0 to breaks).map { i => - if (i == 0) 0.01 // p1 instead of p0 - else if (i == breaks) 0.99 // p99 instead of p100 + if (i == 0) MIN_PROB // p1 instead of p0 + else if (i == breaks) MAX_PROB // p99 instead of p100 else i.toDouble / breaks }.toArray // ... }
11-12
: Consider moving testing code to dedicated unit testsThe
main
method andcompareDistributions
function are being used for testing purposes. It's best practice to move such code into unit tests to enhance maintainability and allow automated testing frameworks to execute them.Would you like assistance in creating unit tests for these functions?
Also applies to: 45-90
val aSum = sumValues(a) | ||
val bSum = sumValues(b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prevent division by zero in histogramDistance
function
In the histogramDistance
function, if aSum
or bSum
is zero, dividing by zero will occur when normalizing histogram values. Add checks to ensure that aSum
and bSum
are not zero before performing divisions to prevent potential runtime exceptions.
Apply this diff to handle zero sums:
val aSum = sumValues(a)
val bSum = sumValues(b)
+ if (aSum == 0.0 || bSum == 0.0) {
+ // Handle zero sum, e.g., return maximum distance or raise an exception
+ return Double.PositiveInfinity
+ }
val aIt = a.entrySet().iterator()
var result = 0.0
val func = termFunc(metric)
while (aIt.hasNext) {
// ...
}
val bIt = b.entrySet().iterator()
while (bIt.hasNext) {
// ...
}
Also applies to: 192-205
private def normalize(arr: Array[Double]): Array[Double] = { | ||
// TODO-OPTIMIZATION: normalize in place instead if this is a hotspot | ||
val result = Array.ofDim[Double](arr.length) | ||
val sum = arr.sum | ||
var i = 0 | ||
while (i < arr.length) { | ||
result.update(i, arr(i) / sum) | ||
i += 1 | ||
} | ||
result | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle division by zero in normalize
function
In the normalize
function, if the sum of the array elements is zero, a division by zero will occur when dividing each element by sum
. Consider adding a check to handle the case when sum
is zero to avoid potential runtime exceptions.
Apply this diff to handle the zero-sum case:
def normalize(arr: Array[Double]): Array[Double] = {
// TODO-OPTIMIZATION: normalize in place instead if this is a hotspot
val result = Array.ofDim[Double](arr.length)
val sum = arr.sum
+ if (sum == 0.0) {
+ // Handle zero sum, e.g., return zeroed array or raise an exception
+ return result
+ }
var i = 0
while (i < arr.length) {
result.update(i, arr(i) / sum)
i += 1
}
result
}
Committable suggestion skipped: line range outside the PR's diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🧹 Outside diff range and nitpick comments (8)
online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala (3)
13-21
: Consider adding null checks for Java List conversions.The conversion methods could throw NullPointerException if the input List or any of its elements are null. Consider adding null checks or using Option for safer handling.
@inline private def toArray(l: java.util.List[java.lang.Double]): Array[Double] = { + if (l == null) return Array.empty[Double] l.iterator().toScala.map(_.toDouble).toArray } @inline private def toArray(l: java.util.List[java.lang.Integer]): Array[Int] = { + if (l == null) return Array.empty[Int] l.iterator().toScala.map(_.toInt).toArray }
23-24
: Consider enforcing temporal order through parameter naming.The comment indicates 'a is after, b is before', but this convention isn't enforced through parameter names. Consider renaming parameters to make the temporal relationship explicit.
-def between(a: TileSummaries, b: TileSummaries, metric: DriftMetric = DriftMetric.JENSEN_SHANNON): TileDrift = { +def between(after: TileSummaries, before: TileSummaries, metric: DriftMetric = DriftMetric.JENSEN_SHANNON): TileDrift = {
23-80
: Add documentation for the between method.The method would benefit from comprehensive ScalaDoc documentation explaining:
- Parameter descriptions and requirements
- Return value contents
- Calculation methodology
- Error conditions
/** Calculates drift metrics between two TileSummaries instances. * * @param a The "after" TileSummaries instance * @param b The "before" TileSummaries instance * @param metric The drift metric to use for calculations (defaults to JENSEN_SHANNON) * @return A TileDrift instance containing all calculated drift metrics * @throws IllegalArgumentException if input parameters are null or invalid */online/src/test/scala/ai/chronon/online/test/stats/DistanceMetricsTest.scala (2)
13-26
: Add documentation for the statistical approach.The
buildPercentiles
method uses a statistical approach to generate percentile points. Consider adding documentation to explain:
- Why 0.01 and 0.99 are used as boundary probabilities
- The statistical theory behind the normal distribution transformation
28-37
: Consider improving the inverse error function implementation.The current implementation:
- Uses a magic number (0.147) without explanation
- Could benefit from a more robust implementation or using an established library
Consider:
- Adding documentation explaining the approximation method
- Using Apache Commons Math or other statistical libraries for a more robust implementation
- Adding validity checks for input range (-1 < x < 1)
api/thrift/api.thrift (2)
268-274
: Enhance struct documentation.The current comment could be more descriptive. Consider adding:
- Purpose and use cases of DriftParams
- Description of each field
- Example usage
Add documentation like this:
-// service takes this, along with column & slice and produces a series of tileDrift objects +/** + * Parameters for calculating drift metrics between two time periods. + * + * @field metric: The statistical method used to measure drift (see DriftMetric enum for thresholds) + * @field lookbacks: List of time windows to analyze drift over + * @field startMillis: Start timestamp in milliseconds for the analysis period + * @field endMillis: End timestamp in milliseconds for the analysis period + * + * Example: + * { + * metric: JENSEN_SHANNON, + * lookbacks: [{length: 7, timeUnit: DAYS}], + * startMillis: 1635724800000, // 2021-11-01 + * endMillis: 1638316800000 // 2021-12-01 + * } + */
276-287
: Add comprehensive documentation.The struct lacks documentation explaining the purpose and interpretation of each drift metric.
Add documentation like this:
+/** + * Represents various drift metrics computed between two TileSummaries. + * Each metric captures a different aspect of distribution change: + * + * Basic Distribution Metrics: + * @field percentileDrift: Drift in the distribution of values based on percentile comparison + * @field histogramDrift: Drift in the frequency distribution of values + * @field countDrift: Relative change in the total count of records + * @field nullRatioDrift: Change in the proportion of null values + * + * Container Type Metrics: + * @field innerCountDrift: Drift in the total number of entries within container types + * @field innerNullCountDrift: Drift in the number of null entries within container types + * @field lengthPercentilesDrift: Drift in the distribution of container lengths + * + * String Type Metrics: + * @field stringLengthPercentilesDrift: Drift in the distribution of string lengths + * + * Note: All drift values are between 0 and 1, where: + * - 0 indicates no drift (distributions are identical) + * - 1 indicates maximum drift (distributions are completely different) + */online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala (1)
95-104
: Optimizenormalize
function by normalizing in place.The comment in the code suggests optimizing the
normalize
function by normalizing in place if it's a performance hotspot. Modifying the array in place can reduce memory usage and improve performance.Consider applying this change:
- private def normalize(arr: Array[Double]): Array[Double] = { - // TODO-OPTIMIZATION: normalize in place instead if this is a hotspot - val result = Array.ofDim[Double](arr.length) + private def normalizeInPlace(arr: Array[Double]): Array[Double] = { + val sum = arr.sum + if (sum == 0.0) return arr var i = 0 while (i < arr.length) { - result.update(i, arr(i) / sum) + arr.update(i, arr(i) / sum) i += 1 } - result + arr }Note: Ensure that modifying the input array in place does not have unintended side effects elsewhere in the codebase.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
api/thrift/api.thrift
(1 hunks)online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala
(1 hunks)online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala
(1 hunks)online/src/test/scala/ai/chronon/online/test/stats/DistanceMetricsTest.scala
(1 hunks)
🔇 Additional comments (6)
online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala (1)
1-10
: LGTM! Well-organized imports and package structure.
The imports are properly organized and specific to the requirements. Good use of ScalaJavaConversions for Java-Scala interoperability.
online/src/test/scala/ai/chronon/online/test/stats/DistanceMetricsTest.scala (2)
1-10
: LGTM! Package structure and imports are well organized.
The imports are specific and properly structured for the test requirements.
1-128
: Verify consistency with other drift metric implementations.
As this is part of a WIP PR for drift metrics, let's verify the consistency of implementations across the codebase.
✅ Verification successful
Let me analyze the implementations further by checking the actual code in the related files.
Test implementation aligns with production code
The verification shows that the test implementation in DistanceMetricsTest
correctly validates the behavior of the production code:
- The thresholds used in tests (0.05 for low-moderate, 0.15 for moderate-severe) match the defined thresholds in
DistanceMetrics.thresholds
- The test covers both
percentileDistance
andhistogramDistance
methods which are used in production byTileDriftCalculator
- The test cases validate different drift scenarios (low, moderate, severe) that align with the production thresholds
- The implementation consistently uses the same drift metrics (JENSEN_SHANNON, HELLINGER, PSI) across test and production code
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for other drift metric implementations and thresholds
echo "Searching for drift metric implementations..."
rg -l "DriftMetric" --type scala
echo "\nSearching for threshold definitions..."
rg "val.*threshold.*=.*0\.(05|15)" --type scala
echo "\nSearching for other statistical distance calculations..."
rg "(percentileDistance|histogramDistance)" --type scala
Length of output: 2533
Script:
#!/bin/bash
# Check the implementation of DistanceMetrics
echo "=== DistanceMetrics.scala ==="
cat online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala
echo -e "\n=== TileDriftCalculator.scala ==="
cat online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala
Length of output: 7789
online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala (3)
94-104
: Handle potential division by zero in normalize
function.
In the normalize
function, if the sum of the array elements is zero, dividing by sum
will result in a division by zero error. Please add a check to handle the case when sum
is zero to prevent potential runtime exceptions.
62-63
: Prevent division by zero in histogramDistance
function.
In the histogramDistance
function, if aSum
or bSum
is zero, dividing by zero will occur when normalizing histogram values. Ensure that aSum
and bSum
are checked for zero before performing divisions to avoid potential errors.
107-109
: Correct the handling of zero probabilities in KL divergence calculation.
In the klDivergenceTerm
function, when b
is zero and a
is not zero, the divergence should approach infinity, as the logarithm of zero is undefined. The current implementation sets the term to zero in such cases, which is mathematically inaccurate.
if (a.isSetPercentiles && b.isSetPercentiles) { | ||
val drift = percentileDistance(toArray(a.getPercentiles), toArray(b.getPercentiles), metric) | ||
result.setPercentilDrift(drift) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add validation for percentile array lengths.
When calculating percentile distance, there's no validation that both arrays have matching lengths, which could lead to incorrect calculations or index out of bounds errors.
if (a.isSetPercentiles && b.isSetPercentiles) {
+ val aPercentiles = toArray(a.getPercentiles)
+ val bPercentiles = toArray(b.getPercentiles)
+ require(aPercentiles.length == bPercentiles.length, "Percentile arrays must have matching lengths")
- val drift = percentileDistance(toArray(a.getPercentiles), toArray(b.getPercentiles), metric)
+ val drift = percentileDistance(aPercentiles, bPercentiles, metric)
result.setPercentilDrift(drift)
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if (a.isSetPercentiles && b.isSetPercentiles) { | |
val drift = percentileDistance(toArray(a.getPercentiles), toArray(b.getPercentiles), metric) | |
result.setPercentilDrift(drift) | |
} | |
if (a.isSetPercentiles && b.isSetPercentiles) { | |
val aPercentiles = toArray(a.getPercentiles) | |
val bPercentiles = toArray(b.getPercentiles) | |
require(aPercentiles.length == bPercentiles.length, "Percentile arrays must have matching lengths") | |
val drift = percentileDistance(aPercentiles, bPercentiles, metric) | |
result.setPercentilDrift(drift) | |
} |
if (a.isSetCount && b.isSetCount && b.getCount != 0.0) { | ||
// preserving sign, since direction of change is usually relevant | ||
val change = (a.getCount - b.getCount).toDouble / b.getCount | ||
result.setCountChangePercent(change * 100) | ||
} | ||
|
||
if (a.isSetNullCount && b.isSetNullCount && b.getNullCount != 0) { | ||
// preserving sign, since direction of change is usually relevant | ||
val aNullRatio = a.getNullCount.toDouble / a.getCount.toDouble | ||
val bNullRatio = b.getNullCount.toDouble / b.getCount.toDouble | ||
val change = (aNullRatio - bNullRatio) / bNullRatio | ||
result.setNullRatioChangePercent(change * 100) | ||
} | ||
|
||
if (a.isSetInnerCount && b.isSetInnerCount && b.getInnerCount != 0.0) { | ||
// preserving sign, since direction of change is usually relevant | ||
val change = (a.getInnerCount - b.getInnerCount).toDouble / b.getInnerCount | ||
result.setInnerCountChangePercent(change * 100) | ||
} | ||
|
||
if (a.isSetInnerNullCount && b.isSetInnerNullCount && b.getInnerNullCount != 0.0) { | ||
// preserving sign, since direction of change is usually relevant | ||
val aNullRatio = a.getInnerNullCount.toDouble / a.getInnerCount.toDouble | ||
val bNullRatio = b.getInnerNullCount.toDouble / b.getInnerCount.toDouble | ||
val change = (aNullRatio - bNullRatio) / bNullRatio | ||
result.setInnerNullCountChangePercent(change * 100) | ||
} | ||
|
||
if (a.isSetLengthPercentiles && b.isSetLengthPercentiles) { | ||
val drift = percentileDistance(toArray(a.getLengthPercentiles).map(_.toDouble), | ||
toArray(b.getLengthPercentiles).map(_.toDouble), | ||
metric) | ||
result.setLengthPercentilesDrift(drift) | ||
} | ||
|
||
if (a.isSetStringLengthPercentiles && b.isSetStringLengthPercentiles) { | ||
val drift = percentileDistance(toArray(a.getStringLengthPercentiles).map(_.toDouble), | ||
toArray(b.getStringLengthPercentiles).map(_.toDouble), | ||
metric) | ||
result.setStringLengthPercentilesDrift(drift) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Refactor repeated percentage calculation pattern.
The code repeats similar percentage calculation patterns across multiple metrics. Consider extracting this logic into a reusable helper method and introducing a constant for the percentage multiplier.
+private val PERCENTAGE_MULTIPLIER = 100.0
+
+private def calculatePercentageChange(newValue: Double, oldValue: Double): Double = {
+ require(oldValue != 0.0, "Old value cannot be zero")
+ ((newValue - oldValue) / oldValue) * PERCENTAGE_MULTIPLIER
+}
+
+private def calculateRatioChange(
+ newNumerator: Double,
+ newDenominator: Double,
+ oldNumerator: Double,
+ oldDenominator: Double
+): Double = {
+ val newRatio = newNumerator / newDenominator
+ val oldRatio = oldNumerator / oldDenominator
+ calculatePercentageChange(newRatio, oldRatio)
+}
Then use these helper methods:
if (a.isSetCount && b.isSetCount && b.getCount != 0.0) {
- val change = (a.getCount - b.getCount).toDouble / b.getCount
- result.setCountChangePercent(change * 100)
+ result.setCountChangePercent(calculatePercentageChange(a.getCount, b.getCount))
}
if (a.isSetNullCount && b.isSetNullCount && b.getNullCount != 0) {
- val aNullRatio = a.getNullCount.toDouble / a.getCount.toDouble
- val bNullRatio = b.getNullCount.toDouble / b.getCount.toDouble
- val change = (aNullRatio - bNullRatio) / bNullRatio
- result.setNullRatioChangePercent(change * 100)
+ result.setNullRatioChangePercent(calculateRatioChange(
+ a.getNullCount.toDouble, a.getCount.toDouble,
+ b.getNullCount.toDouble, b.getCount.toDouble
+ ))
}
Committable suggestion skipped: line range outside the PR's diff.
// a is after, b is before | ||
def between(a: TileSummaries, b: TileSummaries, metric: DriftMetric = DriftMetric.JENSEN_SHANNON): TileDrift = { | ||
val result = new TileDrift() | ||
|
||
if (a.isSetPercentiles && b.isSetPercentiles) { | ||
val drift = percentileDistance(toArray(a.getPercentiles), toArray(b.getPercentiles), metric) | ||
result.setPercentilDrift(drift) | ||
} | ||
|
||
if (a.isSetHistogram && b.isSetHistogram) { | ||
val drift = histogramDistance(a.getHistogram, b.getHistogram, metric) | ||
result.setHistogramDrift(drift) | ||
} | ||
|
||
if (a.isSetCount && b.isSetCount && b.getCount != 0.0) { | ||
// preserving sign, since direction of change is usually relevant | ||
val change = (a.getCount - b.getCount).toDouble / b.getCount | ||
result.setCountChangePercent(change * 100) | ||
} | ||
|
||
if (a.isSetNullCount && b.isSetNullCount && b.getNullCount != 0) { | ||
// preserving sign, since direction of change is usually relevant | ||
val aNullRatio = a.getNullCount.toDouble / a.getCount.toDouble | ||
val bNullRatio = b.getNullCount.toDouble / b.getCount.toDouble | ||
val change = (aNullRatio - bNullRatio) / bNullRatio | ||
result.setNullRatioChangePercent(change * 100) | ||
} | ||
|
||
if (a.isSetInnerCount && b.isSetInnerCount && b.getInnerCount != 0.0) { | ||
// preserving sign, since direction of change is usually relevant | ||
val change = (a.getInnerCount - b.getInnerCount).toDouble / b.getInnerCount | ||
result.setInnerCountChangePercent(change * 100) | ||
} | ||
|
||
if (a.isSetInnerNullCount && b.isSetInnerNullCount && b.getInnerNullCount != 0.0) { | ||
// preserving sign, since direction of change is usually relevant | ||
val aNullRatio = a.getInnerNullCount.toDouble / a.getInnerCount.toDouble | ||
val bNullRatio = b.getInnerNullCount.toDouble / b.getInnerCount.toDouble | ||
val change = (aNullRatio - bNullRatio) / bNullRatio | ||
result.setInnerNullCountChangePercent(change * 100) | ||
} | ||
|
||
if (a.isSetLengthPercentiles && b.isSetLengthPercentiles) { | ||
val drift = percentileDistance(toArray(a.getLengthPercentiles).map(_.toDouble), | ||
toArray(b.getLengthPercentiles).map(_.toDouble), | ||
metric) | ||
result.setLengthPercentilesDrift(drift) | ||
} | ||
|
||
if (a.isSetStringLengthPercentiles && b.isSetStringLengthPercentiles) { | ||
val drift = percentileDistance(toArray(a.getStringLengthPercentiles).map(_.toDouble), | ||
toArray(b.getStringLengthPercentiles).map(_.toDouble), | ||
metric) | ||
result.setStringLengthPercentilesDrift(drift) | ||
} | ||
|
||
result | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add input validation and error handling.
The method lacks comprehensive input validation. Consider adding checks for:
- Null input parameters
- Zero denominators in ratio calculations
- Invalid metric types
def between(a: TileSummaries, b: TileSummaries, metric: DriftMetric = DriftMetric.JENSEN_SHANNON): TileDrift = {
+ require(a != null, "After TileSummaries cannot be null")
+ require(b != null, "Before TileSummaries cannot be null")
+ require(metric != null, "DriftMetric cannot be null")
+
val result = new TileDrift()
Committable suggestion skipped: line range outside the PR's diff.
type Histogram = java.util.Map[String, java.lang.Long] | ||
|
||
def compareDistributions(meanA: Double, | ||
varianceA: Double, | ||
meanB: Double, | ||
varianceB: Double, | ||
breaks: Int = 20, | ||
debug: Boolean = false): Map[DriftMetric, (Double, Double)] = { | ||
|
||
val aPercentiles = buildPercentiles(meanA, varianceA, breaks) | ||
val bPercentiles = buildPercentiles(meanB, varianceB, breaks) | ||
|
||
val aHistogram: Histogram = (0 to breaks) | ||
.map { i => | ||
val value = java.lang.Long.valueOf((math.abs(aPercentiles(i)) * 100).toLong) | ||
i.toString -> value | ||
} | ||
.toMap | ||
.toJava | ||
|
||
val bHistogram: Histogram = (0 to breaks) | ||
.map { i => | ||
val value = java.lang.Long.valueOf((math.abs(bPercentiles(i)) * 100).toLong) | ||
i.toString -> value | ||
} | ||
.toMap | ||
.toJava | ||
|
||
def calculateDrift(metric: DriftMetric): (Double, Double) = { | ||
val pDrift = percentileDistance(aPercentiles, bPercentiles, metric, debug = debug) | ||
val histoDrift = histogramDistance(aHistogram, bHistogram, metric) | ||
(pDrift, histoDrift) | ||
} | ||
|
||
Map( | ||
DriftMetric.JENSEN_SHANNON -> calculateDrift(DriftMetric.JENSEN_SHANNON), | ||
DriftMetric.PSI -> calculateDrift(DriftMetric.PSI), | ||
DriftMetric.HELLINGER -> calculateDrift(DriftMetric.HELLINGER) | ||
) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add input validation and optimize Java-Scala conversions.
Several improvements could enhance the robustness and performance:
- Input Validation:
require(breaks > 0, "breaks must be positive")
require(varianceA > 0 && varianceB > 0, "variance must be positive")
- Performance:
- Consider using Scala collections throughout instead of converting to Java
- Cache the histogram calculations if used multiple times
test("Low drift - similar distributions") { | ||
val drifts = compareDistributions(meanA = 100.0, varianceA = 225.0, meanB = 101.0, varianceB = 225.0) | ||
|
||
// JSD assertions | ||
val (jsdPercentile, jsdHisto) = drifts(DriftMetric.JENSEN_SHANNON) | ||
jsdPercentile should be < 0.05 | ||
jsdHisto should be < 0.05 | ||
|
||
// Hellinger assertions | ||
val (hellingerPercentile, hellingerHisto) = drifts(DriftMetric.HELLINGER) | ||
hellingerPercentile should be < 0.05 | ||
hellingerHisto should be < 0.05 | ||
} | ||
|
||
test("Moderate drift - slightly different distributions") { | ||
val drifts = compareDistributions(meanA = 100.0, varianceA = 225.0, meanB = 105.0, varianceB = 256.0) | ||
|
||
// JSD assertions | ||
val (jsdPercentile, _) = drifts(DriftMetric.JENSEN_SHANNON) | ||
jsdPercentile should (be >= 0.05 and be <= 0.15) | ||
|
||
// Hellinger assertions | ||
val (hellingerPercentile, _) = drifts(DriftMetric.HELLINGER) | ||
hellingerPercentile should (be >= 0.05 and be <= 0.15) | ||
} | ||
|
||
test("Severe drift - different means") { | ||
val drifts = compareDistributions(meanA = 100.0, varianceA = 225.0, meanB = 110.0, varianceB = 225.0) | ||
|
||
// JSD assertions | ||
val (jsdPercentile, _) = drifts(DriftMetric.JENSEN_SHANNON) | ||
jsdPercentile should be > 0.15 | ||
|
||
// Hellinger assertions | ||
val (hellingerPercentile, _) = drifts(DriftMetric.HELLINGER) | ||
hellingerPercentile should be > 0.15 | ||
} | ||
|
||
test("Severe drift - different variances") { | ||
val drifts = compareDistributions(meanA = 100.0, varianceA = 225.0, meanB = 105.0, varianceB = 100.0) | ||
|
||
// JSD assertions | ||
val (jsdPercentile, _) = drifts(DriftMetric.JENSEN_SHANNON) | ||
jsdPercentile should be > 0.15 | ||
|
||
// Hellinger assertions | ||
val (hellingerPercentile, _) = drifts(DriftMetric.HELLINGER) | ||
hellingerPercentile should be > 0.15 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Enhance test coverage and document thresholds.
Current gaps in test coverage:
- Missing test cases for:
- Empty or single-value distributions
- Extreme value scenarios
- PSI metric assertions
- Some tests ignore histogram results without explanation
- Threshold values (0.05, 0.15) should be documented
Consider adding:
test("Edge cases - empty and single-value distributions") {
// Add tests for edge cases
}
test("PSI metric validation") {
val (psiPercentile, psiHisto) = drifts(DriftMetric.PSI)
// Add assertions for PSI
}
api/thrift/api.thrift
Outdated
struct TileDrift { | ||
1: optional double percentilDrift | ||
2: optional double histogramDrift | ||
3: optional double countChangePercent | ||
4: optional double nullRatioChangePercent | ||
|
||
5: optional double innerCountChangePercent // total of number of entries within all containers of this column | ||
6: optional double innerNullCountChangePercent | ||
7: optional double lengthPercentilesDrift | ||
|
||
8: optional double stringLengthPercentilesDrift | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix typo and standardize field naming.
Issues found:
- Typo in field name:
percentilDrift
should bepercentileDrift
- Inconsistent naming pattern: some fields use 'Change' while others use 'Drift'
Apply this diff to fix the issues:
struct TileDrift {
- 1: optional double percentilDrift
+ 1: optional double percentileDrift
- 3: optional double countChangePercent
- 4: optional double nullRatioChangePercent
+ 3: optional double countDrift
+ 4: optional double nullRatioDrift
- 5: optional double innerCountChangePercent
- 6: optional double innerNullCountChangePercent
+ 5: optional double innerCountDrift
+ 6: optional double innerNullCountDrift
}
Committable suggestion skipped: line range outside the PR's diff.
// debug only, remove before merging | ||
val deltas = Array.ofDim[Double](aNormalized.length) | ||
|
||
while (i < aNormalized.length) { | ||
val ai = aNormalized(i) | ||
val bi = bNormalized(i) | ||
val delta = func(ai, bi) | ||
|
||
// debug only remove before merging | ||
deltas.update(i, delta) | ||
|
||
result += delta | ||
i += 1 | ||
} | ||
|
||
if (debug) { | ||
def printArr(arr: Array[Double]): String = | ||
arr.map(v => f"$v%.3f").mkString(", ") | ||
println(f""" | ||
|aProjected : ${printArr(aProjected)} | ||
|bProjected : ${printArr(bProjected)} | ||
|aNormalized: ${printArr(aNormalized)} | ||
|bNormalized: ${printArr(bNormalized)} | ||
|deltas : ${printArr(deltas)} | ||
|result : $result%.4f | ||
|""".stripMargin) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove debug code or implement proper logging before merging.
The percentileDistance
function contains debug code with println
statements and comments indicating removal before merging. It's important to remove this debug code or replace it with appropriate logging to maintain code cleanliness and prevent unintended console output in production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Largely looks good. Flagging a couple of qs / clean up items
api/thrift/api.thrift
Outdated
} | ||
|
||
struct TileDrift { | ||
1: optional double percentilDrift |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/percentil/percentile ?
var i = 0 | ||
var result = 0.0 | ||
|
||
// debug only, remove before merging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flagging to remove
|
||
signX * math.sqrt(term1 - term2) | ||
} | ||
type Histogram = java.util.Map[String, java.lang.Long] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this type already declared in the DistanceMetrics class?
@@ -97,10 +97,10 @@ class LabelJoin(joinConf: api.Join, tableUtils: TableUtils, labelDS: String) { | |||
} | |||
|
|||
labelJoinConf.setups.foreach(tableUtils.sql) | |||
val labelTable = compute(PartitionRange(leftStart, leftEnd), stepDays, Option(labelDS)) | |||
val labelDf = compute(PartitionRange(leftStart, leftEnd), stepDays, Option(labelDS)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can revert?
@@ -78,7 +78,7 @@ object SparkSessionBuilder { | |||
// use all threads - or the tests will be slow | |||
.master("local[*]") | |||
.config("spark.kryo.registrationRequired", s"${localWarehouseLocation.isEmpty}") | |||
.config("spark.local.dir", s"/tmp/$userName/$name") | |||
.config("spark.local.dir", s"/tmp/$userName/${name}_$warehouseId") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for test flakiness?
acee9b2
to
4cc9851
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 40
🧹 Outside diff range and nitpick comments (41)
hub/app/model/Model.scala (1)
8-14
: Consider performance implications of the proposed architectureThe proposed changes suggest a complex pipeline for processing time series data and calculating drift metrics. While the design appears sound, consider these architectural aspects:
Data Flow Optimization:
- Consider batching the tile key generation and fetching operations
- Implement streaming processing for large time series datasets
- Design the metadata storage for efficient group-by lookups
Scalability Considerations:
- Ensure the 1:n tile fetching can handle large-scale data efficiently
- Consider partitioning strategies for the tile storage
- Plan for horizontal scaling of the drift calculations
Integration Points:
- Design clear boundaries between metadata, tile processing, and controller layers
- Consider implementing a facade pattern to simplify the TimeSeriesController interface
- Plan for versioning of the TileSummaries and TileDrift formats
hub/conf/routes (2)
9-12
: Consider separating drift metrics into a dedicated endpoint.The current design combines different metric types in a single endpoint, which could lead to:
- Unclear parameter requirements for different metric types
- Potential confusion about when
algorithm
is required- Mixed concerns in the controller implementation
Consider splitting into dedicated endpoints:
+ GET /api/v1/join/:name/metrics # for regular metrics + GET /api/v1/join/:name/drift # specifically for drift analysis
18-26
: Review architectural approach for planned endpoints.The current TODO and commented routes raise several concerns:
- All routes point to the same controller method, which could become a complex catch-all handler
- Some endpoints appear to have overlapping functionality (e.g.,
/values
vs/distribution
)- The granularity of endpoints aligns better with Single Responsibility Principle
Would you like help with:
- Designing separate controller methods for each endpoint?
- Creating an architectural document outlining the responsibilities of each endpoint?
- Implementing proper parameter validation for each endpoint type?
spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala (2)
50-53
: Consider adding error handling for KVStore initializationWhile the switch to
InMemoryKVStore
is good, consider adding error handling for the table creation operation.val kvStore: () => KVStore = () => { val result = OnlineUtils.buildInMemoryKVStore(namespace) - result.create(Constants.DriftStatsTable, props) + try { + result.create(Constants.DriftStatsTable, props) + } catch { + case e: Exception => + throw new RuntimeException(s"Failed to create DriftStatsTable: ${e.getMessage}", e) + } result }
Line range hint
26-110
: Consider restructuring tests using ScalaTest fixturesThe test class could benefit from using ScalaTest fixtures to manage test resources and cleanup. This would provide a more structured approach to handling KVStore initialization and cleanup.
Consider refactoring to use
BeforeAndAfterEach
:class DriftTest extends AnyFlatSpec with Matchers with BeforeAndAfterEach { private var kvStore: KVStore = _ override def beforeEach(): Unit = { super.beforeEach() kvStore = InMemoryKvStore.build(namespace, () => null) kvStore.create(Constants.MetadataDataset) kvStore.create(Constants.DriftStatsTable) } override def afterEach(): Unit = { if (kvStore != null) { kvStore.drop(Constants.MetadataDataset) kvStore.drop(Constants.DriftStatsTable) } super.afterEach() } // ... existing test cases ... }online/src/test/scala/ai/chronon/online/test/stats/DriftMetricsTest.scala (2)
28-37
: Add documentation for the inverse error function implementation.The
inverseErf
method uses an approximation but lacks documentation about its accuracy and source.Consider adding a documentation comment:
+ /** + * Approximates the inverse error function using the method described in + * [Add citation or reference to the approximation method] + * + * @param x Input value in range [-1, 1] + * @return Approximated inverse error function value + * @note Approximation accuracy: [Add details about accuracy] + */ def inverseErf(x: Double): Double = { + require(x >= -1 && x <= 1, "Input must be in range [-1, 1]") val a = 0.147
79-127
: Enhance test coverage with additional scenarios.The test suite would benefit from additional test cases:
- Edge cases (e.g., very small/large variances)
- Error scenarios (e.g., negative variance)
- Documentation for threshold choices (0.05, 0.15)
Consider adding these test cases:
test("Edge case - very small variance") { val drifts = compareDistributions(meanA = 100.0, varianceA = 0.001, meanB = 100.0, varianceB = 0.001) // Add assertions } test("Should fail with negative variance") { assertThrows[IllegalArgumentException] { compareDistributions(meanA = 100.0, varianceA = -1.0, meanB = 100.0, varianceB = 1.0) } }cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala (1)
50-53
: Add test coverage for timestamp edge cases.The timestamp handling involves several edge cases that should be tested:
- Events exactly at
startTsMillis
(should be excluded)- Events at
queryTime
(should be included)- Milliseconds to microseconds conversion accuracy
Would you like me to help create test cases for these scenarios? I can generate a test suite that covers these edge cases.
hub/app/store/DynamoDBMonitoringStore.scala (1)
Line range hint
93-143
: Enhance error handling and simplify configuration loading logic.The current implementation has several areas for improvement:
- Silent failures with empty
LoadedConfs()
- Repetitive configuration construction
- Lack of structured error reporting
Consider refactoring to:
- Create an enumeration for configuration types:
sealed trait ConfigType object ConfigType { case object Join extends ConfigType case object GroupBy extends ConfigType case object StagingQuery extends ConfigType case object Model extends ConfigType }
- Simplify the pattern matching:
def parseConfigType(key: String): Option[ConfigType] = { if (key.contains("joins/")) Some(ConfigType.Join) else if (key.contains("group_bys/")) Some(ConfigType.GroupBy) else if (key.contains("staging_queries/")) Some(ConfigType.StagingQuery) else if (key.contains("models/")) Some(ConfigType.Model) else None } private def makeLoadedConfs(response: ListResponse): LoadedConfs = { response.values .map { seqAB => val seqKVStrings = seqAB.map(kv => (new String(kv.keyBytes, StandardCharsets.UTF_8), new String(kv.valueBytes, StandardCharsets.UTF_8))) val (successes, failures) = seqKVStrings.map { case (key, value) => parseConfigType(key).map { configType => try { configType match { case ConfigType.Join => Right(LoadedConfs(joins = Seq(getConf[api.Join](value)))) case ConfigType.GroupBy => Right(LoadedConfs(groupBys = Seq(getConf[api.GroupBy](value)))) case ConfigType.StagingQuery => Right(LoadedConfs(stagingQueries = Seq(getConf[api.StagingQuery](value)))) case ConfigType.Model => Right(LoadedConfs(models = Seq(getConf[api.Model](value)))) } } catch { case e: Exception => Left(s"Failed to parse $configType: ${e.getMessage}") } }.getOrElse(Left(s"Unknown configuration type for key: $key")) }.partition(_.isRight) failures.foreach { case Left(error) => logger.error(error) // Add metrics here to track parsing failures } successes.collect { case Right(conf) => conf } .foldLeft(LoadedConfs()) { (acc, curr) => LoadedConfs( joins = acc.joins ++ curr.joins, groupBys = acc.groupBys ++ curr.groupBys, stagingQueries = acc.stagingQueries ++ curr.stagingQueries, models = acc.models ++ curr.models ) } } .getOrElse(LoadedConfs()) }This refactoring:
- Provides better error tracking and reporting
- Simplifies the configuration loading logic
- Makes it easier to add new configuration types
- Separates concerns between parsing and accumulation
Consider adding metrics to track parsing failures and configuration loading performance:
Metrics.counter("config_parse_failures_total").increment() Metrics.timer("config_loading_duration").record(duration)spark/src/test/scala/ai/chronon/spark/test/InMemoryKvStore.scala (1)
Line range hint
45-120
: Add test coverage for drift metrics scenarios.Since this PR implements drift metrics functionality, please ensure to add test cases that cover:
- Timestamp-based version filtering with various time ranges
- Edge cases in drift calculations
- Concurrent access patterns with drift metrics
This aligns with the PR objectives of adding unit tests.
Would you like help in generating comprehensive test cases for drift metrics scenarios?
spark/src/test/scala/ai/chronon/spark/test/OnlineUtils.scala (2)
Line range hint
142-156
: Improve robustness and documentation of the new streaming method.Several improvements are recommended:
- Replace the hardcoded sleep duration with a configurable parameter
- Add comprehensive documentation explaining:
- The method's purpose and behavior
- Why async handling is needed
- The differences from
putStreaming
- Consider using a more robust approach than
Thread.sleep
+ /** + * Handles streaming data processing using JoinSourceRunner. + * This method is intended to replace putStreaming (see TODO below). + * + * @param originalGroupByConf The original GroupBy configuration to process + * @param ds The dataset identifier + * @param namespace The namespace for the operation + * @param kvStoreFunc Function to generate KVStore instances + * @param debug Enable debug mode + * @param asyncWaitMs Time to wait for async operations (default: 5000) + */ def putStreamingNew(originalGroupByConf: api.GroupBy, ds: String, namespace: String, kvStoreFunc: () => KVStore, - debug: Boolean)(implicit session: SparkSession): Unit = { + debug: Boolean, + asyncWaitMs: Long = 5000)(implicit session: SparkSession): Unit = { implicit val mockApi = new MockApi(kvStoreFunc, namespace) val groupByConf = originalGroupByConf.deepCopy() val source = groupByConf.streamingSource.get mutateTopicWithDs(source, ds) val groupByStreaming = new JoinSourceRunner(groupByConf, Map.empty, debug = debug, lagMillis = 0) val query = groupByStreaming.chainedStreamingQuery.trigger(Trigger.Once()).start() - // drain work scheduled as futures over the executioncontext - Thread.sleep(5000) - // there is async stuff under the hood of chained streaming query + // Wait for async operations in the chained streaming query to complete + Thread.sleep(asyncWaitMs) query.awaitTermination() }Would you like me to create a GitHub issue to track the deprecation of
putStreaming
?
Line range hint
153-171
: Document the conditional streaming logic and address the TODO.The integration of
putStreamingNew
looks correct, but there are two areas for improvement:
- The conditional logic between
putStreamingNew
andputStreaming
should be documented- The TODO about
dropDsOnWrite
needs to be addressedConsider adding a comment explaining the streaming method selection:
if (groupByConf.inferredAccuracy == Accuracy.TEMPORAL && groupByConf.streamingSource.isDefined) { val streamingSource = groupByConf.streamingSource.get inMemoryKvStore.create(groupByConf.streamingDataset) + // Use putStreamingNew for join sources, fallback to putStreaming for other cases if (streamingSource.isSetJoinSource) {
Would you like me to help investigate and document the
dropDsOnWrite
quirk mentioned in the TODO?online/src/main/scala/ai/chronon/online/MetadataStore.scala (1)
Line range hint
231-244
: Consider enhancing the create method with props supportBased on the learnings from PR #47, the
KVStore
trait supports acreate
method with props, but this implementation doesn't utilize it. Consider whether supporting props would be beneficial for future extensibility.-def create(dataset: String): Unit = { +def create(dataset: String, props: Map[String, Any] = Map.empty): Unit = { try { logger.info(s"Creating dataset: $dataset") - kvStore.create(dataset) + kvStore.create(dataset, props) logger.info(s"Successfully created dataset: $dataset") } catch { case e: Exception =>spark/src/test/scala/ai/chronon/spark/test/FetchStatsTest.scala (2)
Line range hint
186-189
: Enhance drift metrics test coverageThe drift metrics test section needs improvement:
- It lacks assertions to verify the expected behavior
- Only logs the response without validating the content
- Missing test cases for edge cases and error scenarios
Consider adding comprehensive test cases:
// Example structure for enhanced drift metrics testing val driftRequest = StatsRequest(joinConf.metaData.nameToFilePath + "/drift", None, None) val fetchedDriftSeries = fetchStatsSeries(driftRequest, mockApi) // Add assertions assert(fetchedDriftSeries.values.isDefined, "Drift metrics response should contain values") val driftValues = fetchedDriftSeries.values.get // Verify specific metrics assert(driftValues.contains("drift_score"), "Response should contain drift score") assert(driftValues("drift_score").asInstanceOf[Double] >= 0.0, "Drift score should be non-negative") // Test edge cases val emptyDriftRequest = StatsRequest(joinConf.metaData.nameToFilePath + "/drift", Some(0L), Some(0L)) val emptyDriftSeries = fetchStatsSeries(emptyDriftRequest, mockApi) assert(emptyDriftSeries.values.isEmpty, "Empty time range should return no drift metrics")
Line range hint
67-189
: Improve test maintainability and readabilityThe test method
testFetchStats
is quite long (120+ lines) and handles multiple concerns. Consider breaking it down into smaller, focused test methods:
- Setup methods for data generation
- Separate test methods for each type of metric
- Helper methods for common assertions
Example structure:
class FetchStatsTest extends TestCase { // Setup methods private def setupTestData(): (String, String) = { // Extract data setup logic here (itemQueriesTable, viewsTable) } private def setupJoinConfiguration(itemQueriesTable: String, viewsTable: String): Builders.Join = { // Extract join configuration logic here } // Individual test methods def testBasicStats(): Unit = { val (itemQueriesTable, viewsTable) = setupTestData() val joinConf = setupJoinConfiguration(itemQueriesTable, viewsTable) // Test basic stats fetching } def testLogStats(): Unit = { // Test log stats fetching } def testConsistencyMetrics(): Unit = { // Test consistency metrics } def testDriftMetrics(): Unit = { // Comprehensive drift metrics testing } }api/src/main/scala/ai/chronon/api/Builders.scala (1)
270-271
: Consider grouping related parametersThe
historicalBackfill
parameter appears to be related to other historical/backfill parameters. Consider moving it next to related parameters for better organization and readability.Consider reordering parameters to group related ones together, for example moving
historicalBackfill
nearbackfillStartDate
in theGroupBy.apply
method.online/src/test/scala/ai/chronon/online/test/stats/PivotUtilsTest.scala (3)
13-168
: Consider enhancing test coverage with edge casesWhile the current test coverage is comprehensive, consider adding these scenarios:
- Extreme values in percentiles (e.g., Double.MAX_VALUE, Double.MIN_VALUE)
- Performance test with larger datasets
- Documentation for complex test scenarios (e.g., the histogram merging logic)
170-315
: Consider adding property-based tests for drift metricsThe drift tests are thorough, but consider these enhancements:
- Add property-based tests using ScalaCheck to verify drift calculations
- Test boundary conditions (e.g., 100% drift, negative drifts if applicable)
- Include examples with known drift calculations to serve as documentation
Example property-based test:
import org.scalacheck.Prop.forAll import org.scalacheck.Gen "pivot_drift" should "maintain value integrity under any valid input" in { val driftGen = for { percentilDrift <- Gen.choose(0.0, 1.0) histogramDrift <- Gen.choose(0.0, 1.0) timestamp <- Gen.posNum[Long] } yield { val drift = new TileDrift() drift.setPercentilDrift(percentilDrift) drift.setHistogramDrift(histogramDrift) (drift, timestamp) } forAll(Gen.listOf(driftGen)) { drifts => val result = pivot(drifts.toArray) // Property: Input values should be preserved in output series drifts.sortBy(_._2).map(_._1.getPercentilDrift) == result.getPercentilDriftSeries.asScala.toList } }
1-315
: Consider improving test maintainabilityWhile the tests are well-structured, consider these improvements for better maintainability:
- Extract common test data setup into helper methods or fixtures
- Use more descriptive test names that follow the "given/when/then" pattern
- Add comments explaining the expected behavior for complex scenarios
Example helper method:
private def createDriftWithValues( percentilDrift: Double, histogramDrift: Double, countChange: Double ): TileDrift = { val drift = new TileDrift() drift.setPercentilDrift(percentilDrift) drift.setHistogramDrift(histogramDrift) drift.setCountChangePercent(countChange) drift }spark/src/main/scala/ai/chronon/spark/stats/drift/Expressions.scala (2)
85-86
: Consider thread safety for mutable map operationsThe use of mutable state in
columnTileSummaries
could lead to race conditions if this code is executed in a parallel context. Consider:
- Using immutable data structures
- Adding synchronization if parallel execution is possible
114-141
: Consider simplifying Java interop codeWhile the Java interop code is functional, it could be simplified using Scala's built-in converters:
- row.getSeq[Double](index).map(lang.Double.valueOf).toJava + row.getSeq[Double](index).asJavaThe pattern matching is comprehensive and type-safe, handling all metric types appropriately.
api/thrift/api.thrift (4)
Line range hint
253-267
: Add documentation for the TileSummary struct and its fields.The struct would benefit from documentation explaining:
- The purpose and usage of each field
- The format and meaning of percentile values
- The relationship between container types and inner counts
Add documentation like this:
+/** + * Statistical summary of data within a time tile + */ struct TileSummary { + /** Percentile values for numeric data */ + /** Distribution of categorical values */ // ... add docs for remaining fields }
268-284
: Add formal documentation and consider type safety improvements.
- The inline comment about pivot operation should be formalized into proper documentation
- Consider adding constraints or validation for list lengths matching timestamp count
Add documentation like this:
+/** + * Time series of statistical summaries, created by pivoting (TileSummary, timestamp) tuples + * All lists in this struct must have the same length as the timestamps list + */ struct TileSummarySeries {
339-346
: Add documentation for the new DriftSpec fields.The new fields would benefit from more detailed documentation explaining:
- The relationship between tileSize and lookbackWindows
- The impact of different window sizes on drift detection
Add documentation like this:
+ /** Size of the time window for computing tile summaries */ + /** Windows sizes for comparing current tile with historical data */ + /** Default metric for computing drift between tiles */
380-382
: Add documentation for the new driftSpec field.Add documentation like this:
+ /** Configuration for computing and monitoring data drift */
spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala (3)
30-36
: Enhance schema validation error messages for better clarityThe current
require
statements provide basic error messages which may not give sufficient context when validation fails.Improve the error messages to include the actual data types encountered, aiding in quicker debugging.
Apply this diff:
require(missingColumns.isEmpty, s"Missing required columns: ${missingColumns.mkString(", ")}") -require(summaryDF.schema(keyIndex).dataType == types.BinaryType, "keyBytes must be BinaryType") -require(summaryDF.schema(valueIndex).dataType == types.BinaryType, "valueBytes must be BinaryType") -require(summaryDF.schema(timestampIndex).dataType == types.LongType, "timestamp must be LongType") +require( + summaryDF.schema(keyIndex).dataType == types.BinaryType, + s"Column 'keyBytes' must be BinaryType but found ${summaryDF.schema(keyIndex).dataType}" +) +require( + summaryDF.schema(valueIndex).dataType == types.BinaryType, + s"Column 'valueBytes' must be BinaryType but found ${summaryDF.schema(valueIndex).dataType}" +) +require( + summaryDF.schema(timestampIndex).dataType == types.LongType, + s"Column 'timestamp' must be LongType but found ${summaryDF.schema(timestampIndex).dataType}" +)
16-22
: Consider initializingKVStore
outside of the partition to avoid repeated object creationCurrently,
kvStore
is initialized within each partition, which could lead to multiple unnecessary initializations.If
KVStore
is thread-safe and can be shared across partitions, consider initializing it once outside theforeachPartition
to improve efficiency.Apply this diff:
def run(): Unit = { // Validate schema // ... schema validation code ... + val kvStore: KVStore = api.genKvStore + summaryDF.rdd.foreachPartition((rows: Iterator[Row]) => { - val kvStore: KVStore = api.genKvStore def toPutRequest(row: Row): PutRequest = { // ... toPutRequest implementation ... } // ... rest of the code ... }) }
13-14
: Consolidate imports for better code organizationThe imports for
scala.util.Failure
andscala.util.Success
can be combined into a single statement for brevity.Apply this diff:
-import scala.util.Failure -import scala.util.Success +import scala.util.{Failure, Success}online/src/main/scala/ai/chronon/online/stats/DriftMetrics.scala (4)
124-125
: Replace magic numbers with named constants for clarityThe small value
1e-5
used to adjust zero probabilities in thepsiTerm
method can be defined as a named constant to improve code readability and maintainability.Consider introducing a named constant:
private val EPSILON = 1e-5And update the code:
val aFixed = if (a == 0.0) 1e-5 else a val bFixed = if (b == 0.0) 1e-5 else b +val aFixed = if (a == 0.0) EPSILON else a +val bFixed = if (b == 0.0) EPSILON else bThis makes it easier to adjust the epsilon value in the future and documents its purpose.
95-95
: Optimize thenormalize
method by normalizing in placeThe TODO comment suggests optimizing the
normalize
method by performing normalization in place if it's a performance hotspot. Implementing this can reduce memory allocations and improve performance.Here's how you might modify the method to normalize in place:
private def normalize(arr: Array[Double]): Array[Double] = { val sum = arr.sum + if (sum == 0.0) { + // Handle zero-sum case by returning an array of zeros + return Array.fill(arr.length)(0.0) + } - val result = Array.ofDim[Double](arr.length) var i = 0 while (i < arr.length) { - result.update(i, arr(i) / sum) + arr.update(i, arr(i) / sum) i += 1 } - result + arr }This modifies the input array directly, which can improve performance by avoiding the creation of a new array.
130-135
: Ensure exhaustive pattern matching intermFunc
methodThe
termFunc
method matches onDriftMetric
, but does not provide a default case. IfDriftMetric
can have new values in the future, consider adding a default case to handle unexpected values.Add a default case to prevent potential
MatchError
exceptions:d match { case DriftMetric.PSI => psiTerm case DriftMetric.HELLINGER => hellingerTerm case DriftMetric.JENSEN_SHANNON => jsdTerm + case _ => throw new IllegalArgumentException(s"Unsupported DriftMetric: $d") }
Alternatively, if
DriftMetric
is a sealed trait or enum and all cases are covered, you can ignore this suggestion.
146-151
: Ensure exhaustive pattern matching inthresholds
methodSimilar to
termFunc
, thethresholds
method matches onDriftMetric
without a default case. Ensure that all possible values are handled or add a default case.Consider adding a default case:
d match { case DriftMetric.JENSEN_SHANNON => Thresholds(0.05, 0.15) case DriftMetric.HELLINGER => Thresholds(0.05, 0.15) case DriftMetric.PSI => Thresholds(0.1, 0.2) + case _ => throw new IllegalArgumentException(s"Unsupported DriftMetric: $d") }
This prevents potential runtime exceptions if new
DriftMetric
values are introduced.online/src/main/scala/ai/chronon/online/stats/DriftStore.scala (1)
121-122
: Optimize data fetching to prevent over-fetching ingetDriftSeries
The TODO comment suggests potential over-fetching when
lookBack
is significantly larger than the range betweenstartMs
andendMs
. This can lead to unnecessary resource consumption and decreased performance.Would you like assistance in implementing an optimization to adjust the fetched data range based on the
lookBack
window and the actual time range (startMs
toendMs
)? This could involve calculating a more precise fetch window or merging summaries efficiently.online/src/main/scala/ai/chronon/online/stats/PivotUtils.scala (2)
77-79
: Consider avoiding custom operators for better readabilityThe custom pipe operator
|>
defined inPipeOps
may reduce code readability, especially for developers unfamiliar with this pattern. Using standard method chaining or existing operators can improve maintainability and team comprehension.Remove the custom operator and adjust the chained calls:
- implicit class PipeOps[A](val value: A) extends AnyVal { - def |>[B](f: A => B): B = f(value) - }And update usages, for example:
- .setPercentiles(iterator(_.getPercentiles).toArray |> transpose) + .setPercentiles(transpose(iterator(_.getPercentiles).toArray))
16-18
: Update or remove outdated TODO commentThe TODO comment on line 18 suggests that
TileDrift => TileDriftSeries
implementation is pending. Sincepivot(driftsWithTimestamps: Array[(TileDrift, Long)])
is now implemented, consider updating or removing this comment to reflect the current state of the code.Apply this diff to remove the outdated comment:
- // TODO: TileDrift => TileDriftSeries
online/src/main/scala/ai/chronon/online/stats/Display.scala (3)
32-36
: Use a predefined color palette for consistent chart series colorsThe current implementation uses randomly generated colors for each series, which might result in similar or hard-to-distinguish colors. Using a predefined color palette ensures each series has a distinct and recognizable color.
Apply this diff to use a predefined color palette:
+ val colors = Seq("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "#8c564b") - val seriesConfig = chart.seriesList.map(s => s"""{ + val seriesConfig = chart.seriesList.zipWithIndex.map { case (s, i) => s"""{ | label: "${s.name}", - | stroke: "rgb(${scala.util.Random.nextInt(255)}, ${scala.util.Random.nextInt(255)}, ${scala.util.Random.nextInt(255)})" + | stroke: "${colors(i % colors.length)}" | - |}""".stripMargin).mkString(",\n") + |}""".stripMargin }.mkString(",\n")
174-175
: Remove commented-out code to improve readabilityThe commented-out line
// page.render
is unnecessary and can be removed to keep the codebase clean and maintainable.Apply this diff to remove the commented code:
- // page.render
182-183
: Bind the server to localhost for enhanced securityCurrently, the server listens on all network interfaces (
0.0.0.0
). Since this tool is intended for debugging purposes, consider binding the server tolocalhost
(127.0.0.1
) to prevent external access.Apply this diff to bind the server to localhost:
override def host: String = - "0.0.0.0" + "127.0.0.1" override def port: Int = portValapi/src/main/scala/ai/chronon/api/Extensions.scala (1)
163-169
: Simplify thedriftTileSize
method usingflatMap
.The
driftTileSize
method can be simplified by utilizingflatMap
to avoid nestedOption
s and make the code more concise.Here's the suggested refactor:
-def driftTileSize: Option[Window] = { - Option(metaData.getDriftSpec) match { - case Some(driftSpec) => - Option(driftSpec.getTileSize).orElse(Some(Constants.DefaultDriftTileSize)) - case None => None - } +def driftTileSize: Option[Window] = + Option(metaData.getDriftSpec) + .flatMap(driftSpec => Option(driftSpec.getTileSize)) + .orElse(Some(Constants.DefaultDriftTileSize))spark/src/test/scala/ai/chronon/spark/test/stats/drift/PrepareData.scala (2)
119-124
: Prefer Scala collections over Java collections for consistencyConsider using Scala's collections and converters instead of Java's
ArrayList
to create thewindows
list. This promotes idiomatic Scala code and enhances readability.Apply this diff to use Scala's
List
and convert it to a Java list:+import scala.collection.JavaConverters._ val driftSpec = new DriftSpec() driftSpec.setTileSize(new Window(30, TimeUnit.MINUTES)) driftSpec.setDriftMetric(DriftMetric.JENSEN_SHANNON) -val windows = new java.util.ArrayList[Window]() -windows.add(new Window(30, TimeUnit.MINUTES)) +val windows = List(new Window(30, TimeUnit.MINUTES)) driftSpec.setLookbackWindows(windows.asJava)
294-294
: Avoid code duplication by reusinggenerateNonOverlappingWindows
The
anomalyWindows
variable is generated in bothgenerateTimeseriesWithAnomalies
andgenerateFraudSampleData
. Consider refactoring to avoid duplication and maintain consistency.You might extract the anomaly window generation into a shared method or pass
anomalyWindows
as a parameter to reuse the logic.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (43)
.github/workflows/test_scala_and_python.yaml
(2 hunks)api/src/main/scala/ai/chronon/api/Builders.scala
(2 hunks)api/src/main/scala/ai/chronon/api/Constants.scala
(2 hunks)api/src/main/scala/ai/chronon/api/Extensions.scala
(2 hunks)api/src/main/scala/ai/chronon/api/ThriftJsonCodec.scala
(2 hunks)api/thrift/api.thrift
(4 hunks)build.sbt
(1 hunks)cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala
(2 hunks)cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala
(1 hunks)hub/app/controllers/TimeSeriesController.scala
(4 hunks)hub/app/model/Model.scala
(1 hunks)hub/app/store/DynamoDBMonitoringStore.scala
(1 hunks)hub/conf/routes
(1 hunks)online/src/main/scala/ai/chronon/online/Api.scala
(3 hunks)online/src/main/scala/ai/chronon/online/Fetcher.scala
(1 hunks)online/src/main/scala/ai/chronon/online/FetcherBase.scala
(2 hunks)online/src/main/scala/ai/chronon/online/MetadataStore.scala
(3 hunks)online/src/main/scala/ai/chronon/online/stats/AssignIntervals.scala
(1 hunks)online/src/main/scala/ai/chronon/online/stats/Display.scala
(1 hunks)online/src/main/scala/ai/chronon/online/stats/DriftMetrics.scala
(1 hunks)online/src/main/scala/ai/chronon/online/stats/DriftStore.scala
(1 hunks)online/src/main/scala/ai/chronon/online/stats/PivotUtils.scala
(1 hunks)online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala
(1 hunks)online/src/test/scala/ai/chronon/online/test/stats/AssignIntervalsTest.scala
(1 hunks)online/src/test/scala/ai/chronon/online/test/stats/DriftMetricsTest.scala
(1 hunks)online/src/test/scala/ai/chronon/online/test/stats/PivotUtilsTest.scala
(1 hunks)spark/src/main/scala/ai/chronon/spark/LabelJoin.scala
(2 hunks)spark/src/main/scala/ai/chronon/spark/SparkSessionBuilder.scala
(1 hunks)spark/src/main/scala/ai/chronon/spark/stats/drift/DistanceMetrics.scala
(0 hunks)spark/src/main/scala/ai/chronon/spark/stats/drift/Expressions.scala
(4 hunks)spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala
(1 hunks)spark/src/test/scala/ai/chronon/spark/test/ChainingFetcherTest.scala
(2 hunks)spark/src/test/scala/ai/chronon/spark/test/ExternalSourcesTest.scala
(2 hunks)spark/src/test/scala/ai/chronon/spark/test/FetchStatsTest.scala
(2 hunks)spark/src/test/scala/ai/chronon/spark/test/FetcherTest.scala
(5 hunks)spark/src/test/scala/ai/chronon/spark/test/InMemoryKvStore.scala
(3 hunks)spark/src/test/scala/ai/chronon/spark/test/OnlineUtils.scala
(1 hunks)spark/src/test/scala/ai/chronon/spark/test/SchemaEvolutionTest.scala
(2 hunks)spark/src/test/scala/ai/chronon/spark/test/StreamingTest.scala
(2 hunks)spark/src/test/scala/ai/chronon/spark/test/bootstrap/DerivationTest.scala
(1 hunks)spark/src/test/scala/ai/chronon/spark/test/bootstrap/LogBootstrapTest.scala
(1 hunks)spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala
(3 hunks)spark/src/test/scala/ai/chronon/spark/test/stats/drift/PrepareData.scala
(5 hunks)
💤 Files with no reviewable changes (1)
- spark/src/main/scala/ai/chronon/spark/stats/drift/DistanceMetrics.scala
✅ Files skipped from review due to trivial changes (1)
- spark/src/test/scala/ai/chronon/spark/test/bootstrap/LogBootstrapTest.scala
🚧 Files skipped from review as they are similar to previous changes (6)
- api/src/main/scala/ai/chronon/api/ThriftJsonCodec.scala
- online/src/main/scala/ai/chronon/online/stats/AssignIntervals.scala
- online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala
- online/src/test/scala/ai/chronon/online/test/stats/AssignIntervalsTest.scala
- spark/src/main/scala/ai/chronon/spark/LabelJoin.scala
- spark/src/main/scala/ai/chronon/spark/SparkSessionBuilder.scala
🧰 Additional context used
📓 Learnings (3)
cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (6)
Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:245-260
Timestamp: 2024-11-12T09:38:33.532Z
Learning: In `DynamoDBKVStoreImpl.scala`, refactoring methods like `extractTimedValues` and `extractListValues` to eliminate code duplication is discouraged if it would make the code more convoluted.
Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:245-260
Timestamp: 2024-11-12T09:38:33.532Z
Learning: In `DynamoDBKVStoreImpl.scala`, refactoring methods like `extractTimedValues` and `extractListValues` to eliminate code duplication is discouraged if it would make the code more convoluted.
Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:167-229
Timestamp: 2024-11-12T09:38:33.532Z
Learning: Errors and logging are handled at a granular level in `handleDynamoDbOperation` by passing the metrics context and updating relevant metric values based on the invoking operation.
Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:167-229
Timestamp: 2024-11-12T09:38:33.532Z
Learning: Leveraging DynamoDB batch writes for performance optimization in `multiPut` will be addressed in a future review.
Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/test/scala/ai/chronon/integrations/aws/DynamoDBKVStoreTest.scala:175-175
Timestamp: 2024-11-12T09:38:33.532Z
Learning: Hardcoding future timestamps in tests within `DynamoDBKVStoreTest.scala` is acceptable when data is generated and queried within the same time range, ensuring the tests remain valid over time.
Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/test/scala/ai/chronon/integrations/aws/DynamoDBKVStoreTest.scala:175-175
Timestamp: 2024-11-12T09:38:33.532Z
Learning: Hardcoding future timestamps in tests within `DynamoDBKVStoreTest.scala` is acceptable when data is generated and queried within the same time range, ensuring the tests remain valid over time.
online/src/main/scala/ai/chronon/online/MetadataStore.scala (1)
Learnt from: chewy-zlai
PR: zipline-ai/chronon#47
File: online/src/main/scala/ai/chronon/online/MetadataStore.scala:232-0
Timestamp: 2024-11-12T09:38:33.532Z
Learning: In the `KVStore` trait located at `online/src/main/scala/ai/chronon/online/KVStore.scala`, there are two `create` methods: `def create(dataset: String): Unit` and `def create(dataset: String, props: Map[String, Any]): Unit`. The version with `props` ignores the `props` parameter, and the simpler version without `props` is appropriate when `props` are not needed.
spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala (3)
Learnt from: chewy-zlai
PR: zipline-ai/chronon#50
File: spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala:37-40
Timestamp: 2024-11-12T09:38:33.532Z
Learning: Avoid using `Await.result` in production code; prefer handling `Future`s asynchronously when possible to prevent blocking.
Learnt from: chewy-zlai
PR: zipline-ai/chronon#62
File: spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala:9-10
Timestamp: 2024-11-12T09:38:33.532Z
Learning: In Spark applications, when defining serializable classes, passing an implicit `ExecutionContext` parameter can cause serialization issues. In such cases, it's acceptable to use `scala.concurrent.ExecutionContext.Implicits.global`.
Learnt from: nikhil-zlai
PR: zipline-ai/chronon#50
File: spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala:19-47
Timestamp: 2024-11-12T09:38:33.532Z
Learning: In Scala, the `grouped` method on collections returns an iterator, allowing for efficient batch processing without accumulating all records in memory.
🪛 actionlint
.github/workflows/test_scala_and_python.yaml
52-52: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-14.0", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-13.0", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "macos-12.0", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
🔇 Additional comments (48)
hub/app/model/Model.scala (2)
12-14
: Review the tile processing pipeline design
The proposed tile processing pipeline introduces several critical components:
- Fetching TileSummaries from tileKey (1:n relationship)
- Comparing TileSummaries to generate TileDrift
- Integration with TimeSeriesController using column-to-tileDrift mapping
Consider the following architectural recommendations:
- Ensure the TileSummaries fetching is optimized for the 1:n relationship
- Consider caching strategies for frequently accessed tiles
- Design the TileDrift comparison to be parallelizable
- Plan for error handling in the TimeSeriesController integration
This aligns with the existing metric types (Drift
, Skew
) and algorithms (PSI
, KL
) but may require additional drift algorithms based on the comparison requirements.
8-11
: Verify the metadata upload and column listing functionality
The proposed workflow suggests significant architectural changes:
- Mapping join configurations to group-by columns
- New methods for fetching join configurations
- Generation of tile keys from columns and time parameters
These changes will require careful integration with existing data structures. Please ensure:
- The
GroupBy
andJoin
case classes can support the proposed metadata mapping - The column listing functionality aligns with the existing
joinFeatures
andfeatures
fields
hub/conf/routes (2)
6-8
: Add parameter documentation and validation.
The new model timeseries endpoint lacks documentation about:
- Valid ranges for
startTs
andendTs
- Supported values for
offset
andalgorithm
- Response format and possible error cases
13-16
: Clarify granularity parameter and ensure consistency with join endpoints.
The feature endpoint introduces a granularity
parameter not present in join endpoints:
- Its purpose and valid values should be documented
- Consider whether this parameter should also apply to join endpoints for consistency
api/src/main/scala/ai/chronon/api/Constants.scala (2)
66-68
: LGTM: Well-structured drift-related constants
The new constants follow consistent naming conventions and are logically grouped together. This provides a good foundation for the drift metrics implementation.
40-40
: Verify the impact of metadata key rename
The constant has been renamed from ChrononMetadataKey
to MetadataDataset
with a different value. This is a breaking change that needs careful verification.
spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala (2)
9-9
: LGTM: Import additions align with new functionality
The new imports for DriftStore
and InMemoryKvStore
are appropriate for the test modifications.
Also applies to: 17-17
77-77
: Verify test coverage with reduced sample size
The sample size has been reduced from 1M to 100K rows. While this improves test execution time, ensure it still provides adequate coverage for edge cases and anomaly detection scenarios.
Also applies to: 80-80
spark/src/test/scala/ai/chronon/spark/test/StreamingTest.scala (2)
113-113
: LGTM! Usage is consistent with import change.
The constant usage in inMemoryKvStore.create()
is correctly updated to match the import change.
23-23
: LGTM! Verify consistent usage across codebase.
The import change from ChrononMetadataKey
to MetadataDataset
improves naming clarity.
Let's verify consistent usage across the codebase:
✅ Verification successful
Verified consistent usage of MetadataDataset
across the codebase.
All references to ChrononMetadataKey
have been successfully replaced with MetadataDataset
.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any remaining references to the old constant name
# and verify consistent usage of the new constant.
# Check for any remaining references to old constant
echo "Checking for remaining ChrononMetadataKey references:"
rg "ChrononMetadataKey"
# Check usage pattern of new constant
echo "Verifying MetadataDataset usage:"
rg "MetadataDataset"
Length of output: 3429
.github/workflows/test_scala_and_python.yaml (1)
67-67
: Verify memory requirements for drift metrics tests
The significant increase in memory allocation (from 8G to 24G) suggests these tests have substantial memory requirements. While this aligns with the drift metrics implementation, we should verify this is the minimum required memory to prevent resource waste.
online/src/test/scala/ai/chronon/online/test/stats/DriftMetricsTest.scala (1)
1-11
: LGTM! Test class setup follows best practices.
The test class is properly structured with appropriate imports and test framework extensions.
spark/src/test/scala/ai/chronon/spark/test/ExternalSourcesTest.scala (2)
18-18
: LGTM: Import statement updated for better clarity
The change from ChrononMetadataKey
to MetadataDataset
provides a more descriptive name that better reflects its purpose.
97-97
: LGTM: Constant usage updated consistently
The usage of MetadataDataset
is consistent with the import change. The test's behavior remains unchanged.
Let's verify this refactoring is consistent across the codebase:
✅ Verification successful
Verification Successful: Consistent Usage of MetadataDataset Confirmed
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the consistent usage of MetadataDataset across the codebase
# and ensure no lingering references to the old ChrononMetadataKey
# Check for any remaining references to the old constant
echo "Checking for any remaining references to ChrononMetadataKey..."
rg "ChrononMetadataKey"
# Verify the usage of new constant
echo "Verifying usage of MetadataDataset..."
rg "MetadataDataset"
Length of output: 3457
spark/src/test/scala/ai/chronon/spark/test/InMemoryKvStore.scala (2)
45-45
: LGTM! Explicit type annotation improves clarity.
The addition of explicit type annotation for the encoder enhances type safety and code readability.
120-120
: LGTM! Improved thread safety with computeIfAbsent.
The change from put
to computeIfAbsent
makes the dataset creation atomic and prevents potential race conditions in concurrent scenarios.
spark/src/test/scala/ai/chronon/spark/test/OnlineUtils.scala (1)
160-160
: LGTM! Verify constant usage consistency.
The rename from ChrononMetadataKey
to MetadataDataset
aligns with the standardization effort.
✅ Verification successful
Verified! All instances of ChrononMetadataKey
have been successfully updated to MetadataDataset
.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Verify that all occurrences of ChrononMetadataKey have been updated
rg "ChrononMetadataKey" --type scala
Length of output: 36
online/src/main/scala/ai/chronon/online/MetadataStore.scala (3)
19-19
: LGTM: Import change aligns with the metadata dataset refactoring
The addition of MetadataDataset
import is consistent with the broader changes to standardize metadata handling across the codebase.
204-204
: LGTM: Dataset parameter change is consistent
The default value change from ChrononMetadataKey
to MetadataDataset
in the put
method aligns with the constructor change.
44-44
: Verify the impact of changing the default dataset value
The constructor's default dataset value has changed from ChrononMetadataKey
to MetadataDataset
. While this aligns with the refactoring, we should verify that all existing usages are updated.
✅ Verification successful
Change is Verified: Default Dataset Value Updated Successfully
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Find all MetadataStore instantiations to verify they handle the dataset parameter change
# Search for MetadataStore instantiations
echo "Searching for MetadataStore instantiations..."
rg -l "new MetadataStore" | while read -r file; do
echo "=== $file ==="
rg -A 2 "new MetadataStore" "$file"
done
# Search for potential references to the old ChrononMetadataKey
echo -e "\nSearching for potential references to ChrononMetadataKey..."
rg "ChrononMetadataKey"
Length of output: 4382
spark/src/test/scala/ai/chronon/spark/test/FetchStatsTest.scala (1)
23-23
: LGTM: Constant renaming improves clarity
The renaming from ChrononMetadataKey
to MetadataDataset
is a good change that makes the constant's purpose clearer and more descriptive.
Also applies to: 142-142
build.sbt (2)
126-127
: Verify architectural alignment of new web dependencies
The addition of cask
and scalatags
suggests implementing web-based visualization for drift metrics. However, this project already uses Play Framework in the hub
module. Consider:
- Whether this creates unnecessary framework duplication
- If these visualizations should be integrated into the existing Play-based frontend instead
Let's verify the usage of these dependencies:
#!/bin/bash
# Search for cask/scalatags usage to understand the implementation approach
rg -l "import cask\."
rg -l "import scalatags\."
# Check if there are any existing web endpoints in the online module
rg -l "extends cask.Routes"
125-127
: Verify dependency versions for security vulnerabilities
Let's ensure the new dependencies are using secure versions.
✅ Verification successful
Dependencies are secure
No known security vulnerabilities found for the specified dependencies.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for security advisories and latest versions
gh api graphql -f query='
{
securityVulnerabilities(first: 5, ecosystem: MAVEN, package: "com.lihaoyi:cask_2.12") {
nodes {
advisory {
summary
severity
publishedAt
}
vulnerableVersionRange
firstPatchedVersion {
identifier
}
}
}
}'
# Check Maven Central for latest versions
curl -s "https://search.maven.org/solrsearch/select?q=g:%22com.lihaoyi%22+AND+a:%22cask_2.12%22&rows=1&wt=json" | jq '.response.docs[0].latestVersion'
curl -s "https://search.maven.org/solrsearch/select?q=g:%22com.lihaoyi%22+AND+a:%22scalatags_2.12%22&rows=1&wt=json" | jq '.response.docs[0].latestVersion'
Length of output: 684
Script:
#!/bin/bash
# Check for security advisories for scalatags
gh api graphql -f query='
{
securityVulnerabilities(first: 5, ecosystem: MAVEN, package: "com.lihaoyi:scalatags_2.12") {
nodes {
advisory {
summary
severity
publishedAt
}
vulnerableVersionRange
firstPatchedVersion {
identifier
}
}
}
}'
Length of output: 361
online/src/test/scala/ai/chronon/online/test/stats/PivotUtilsTest.scala (1)
1-11
: LGTM! Well-structured test file setup
The file follows Scala testing best practices with appropriate imports and test framework setup.
spark/src/main/scala/ai/chronon/spark/stats/drift/Expressions.scala (3)
6-6
: LGTM: Import change aligns with TileSummary refactoring
The import statement has been updated to reflect the renamed type, maintaining consistency with the broader refactoring effort.
Line range hint 93-101
: LGTM: Type-safe summary population logic
The summary population logic maintains type safety while adapting to the new TileSummary
type. The iterator-based transformation is efficient and maintains referential transparency.
76-76
: Verify all TileRow usages for breaking change
The change from TileSummaries
to TileSummary
in the case class is a breaking change that affects serialization. This needs careful verification of:
- All code that creates or consumes TileRow instances
- Data migration strategy for existing data
✅ Verification successful
TileRow usage verified with no serialization impact found
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for TileRow usage patterns
echo "Searching for TileRow usage patterns..."
rg -l "TileRow" | while read -r file; do
echo "=== $file ==="
rg -A 5 "TileRow" "$file"
done
# Search for potential serialization code
echo "Checking for serialization related code..."
rg -l "TileRow.*(?:serialize|deserialize|fromJson|toJson)"
Length of output: 2262
online/src/main/scala/ai/chronon/online/Api.scala (2)
264-264
: Verify the metadata constant update across the codebase
The change from Constants.ChrononMetadataKey
to Constants.MetadataDataset
appears to be part of a broader standardization of metadata handling. Let's ensure this update is consistent across the codebase.
Also applies to: 275-275
49-49
:
Update comment to reflect parameter rename from afterTsMillis
to startTsMillis
The comment above the case class still references afterTsMillis
but the parameter has been renamed to startTsMillis
. This change affects the semantics of timestamp handling in scan requests.
Let's verify the impact of this breaking change:
spark/src/test/scala/ai/chronon/spark/test/ChainingFetcherTest.scala (2)
21-21
: LGTM: Import statement follows conventions
The import of MetadataDataset
from Constants
is correctly placed and follows Scala's import organization conventions.
247-247
: Verify metadata dataset creation across test suite
The change from ChrononMetadataKey
to MetadataDataset
appears to be part of a broader refactoring effort. While this change is correct, we should ensure consistent usage across all test cases.
✅ Verification successful
Metadata dataset creation across the test suite is consistent
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify consistent usage of MetadataDataset across test files
# Look for any remaining instances of ChrononMetadataKey that might have been missed
# Search for any remaining usage of ChrononMetadataKey
rg "ChrononMetadataKey" --type scala
# Search for consistent usage of MetadataDataset
rg "MetadataDataset" --type scala
Length of output: 3263
cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (1)
130-130
: Verify the impact of switching from afterTsMillis to startTsMillis
The change in partitioning logic from afterTsMillis
to startTsMillis
could affect query behavior. Ensure all callers have been updated to use startTsMillis
.
✅ Verification successful
Partitioning logic correctly updated to use startTsMillis
All active references to afterTsMillis
have been successfully replaced with startTsMillis
. Remaining instances are limited to comments and do not impact functionality.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Search for any remaining references to afterTsMillis that might need updating
# Search for afterTsMillis usage in Scala files
echo "Searching for afterTsMillis usage in Scala files..."
rg -t scala "afterTsMillis" -A 5
# Search for KVStore.GetRequest construction to verify parameter usage
echo "Searching for GetRequest construction patterns..."
ast-grep --pattern 'GetRequest($$$)' -l scala
Length of output: 4345
spark/src/test/scala/ai/chronon/spark/test/SchemaEvolutionTest.scala (2)
322-322
: LGTM: MetadataDataset usage in KV store initialization
The change from ChrononMetadataKey
to MetadataDataset
in the in-memory key-value store initialization is consistent with the refactoring pattern.
#!/bin/bash
# Description: Verify consistent initialization pattern across test files
# Expected: All test files should use MetadataDataset for KV store initialization
# Search for KV store initialization patterns in test files
rg "inMemoryKvStore\.create\(" --type scala -g "*Test.scala"
19-19
: Verify consistent usage of MetadataDataset import
The import change from ChrononMetadataKey
to MetadataDataset
appears to be part of a broader refactoring effort.
✅ Verification successful
Consistent usage of MetadataDataset confirmed
No remaining instances of ChrononMetadataKey
found. All usages appropriately updated to MetadataDataset
.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify consistent usage of MetadataDataset across the codebase
# Expected: All occurrences of ChrononMetadataKey should be replaced with MetadataDataset
# Check for any remaining usage of ChrononMetadataKey
rg "ChrononMetadataKey"
# Verify the new MetadataDataset usage pattern
rg "Constants\.MetadataDataset" -A 2
Length of output: 5962
api/thrift/api.thrift (1)
286-306
: Fix typo and standardize field naming as previously suggested.
The issues identified in previous reviews remain:
- Typo:
percentilDrift
should bepercentileDrift
- Inconsistent naming: some fields use 'Change' while others use 'Drift'
Apply this diff to fix the issues:
struct TileDrift {
- 1: optional double percentilDrift
+ 1: optional double percentileDrift
- 3: optional double countChangePercent
- 4: optional double nullRatioChangePercent
+ 3: optional double countDrift
+ 4: optional double nullRatioDrift
- 5: optional double innerCountChangePercent
- 6: optional double innerNullCountChangePercent
+ 5: optional double innerCountDrift
+ 6: optional double innerNullCountDrift
}
spark/src/test/scala/ai/chronon/spark/test/bootstrap/DerivationTest.scala (1)
433-433
: LGTM! Verify constant renaming across the codebase.
The constant renaming from ChrononMetadataKey
to MetadataDataset
is consistent with the changes mentioned in the summary.
Let's verify that this constant renaming has been consistently applied across the codebase:
✅ Verification successful
LGTM! Constant renaming successfully verified across the codebase.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify that the old constant name is not used anywhere in the codebase
# and the new constant name is used consistently.
echo "Checking for any remaining usage of the old constant name..."
rg "ChrononMetadataKey"
echo "Verifying usage of the new constant name..."
rg "MetadataDataset"
Length of output: 3465
online/src/main/scala/ai/chronon/online/FetcherBase.scala (3)
Line range hint 57-61
: LGTM: Constructor parameter update looks good.
The change from ChrononMetadataKey
to MetadataDataset
as the default value for metaDataSet
parameter is consistent with the changes mentioned in the summary.
24-24
: LGTM: Import statement correctly added.
The import for MetadataDataset
from the Constants object is properly placed and aligns with the constructor parameter change.
Line range hint 57-61
: Verify test coverage for metadata handling.
Since this class is critical for data fetching and processing, please ensure that existing tests cover the metadata parameter change. This will help verify that the switch from ChrononMetadataKey
to MetadataDataset
doesn't affect data retrieval operations.
✅ Verification successful
Test Coverage Verified for Metadata Handling
Existing tests in FetcherBaseTest.scala
cover the changes to the metaDataSet
parameter, ensuring data retrieval operations are unaffected.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for test files covering FetcherBase metadata handling
# Look for test files related to FetcherBase
fd -e scala "FetcherBase.*Test.scala|Test.*FetcherBase.scala"
# Search for metadata-related test cases
rg -l "metaDataSet|ChrononMetadataKey|MetadataDataset" $(fd -e scala ".*Test.scala")
Length of output: 5872
spark/src/test/scala/ai/chronon/spark/test/FetcherTest.scala (3)
22-22
: LGTM: Import change aligns with metadata constant renaming.
The import of MetadataDataset
from Constants
is correctly placed and consistent with the refactoring mentioned in the summary.
97-98
: LGTM: Test data setup correctly uses new metadata constant.
The test data generation properly uses the new MetadataDataset
constant for creating test datasets.
625-625
: LGTM: Consistent usage of new metadata constant across test cases.
The metadata store creation consistently uses the new MetadataDataset
constant in both test scenarios, maintaining the same functionality while adhering to the updated naming convention.
Also applies to: 755-755
spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala (2)
41-44
: Ensure proper handling of null
values in toPutRequest
The toPutRequest
function substitutes null
values with empty arrays for keyBytes
and valueBytes
. While this prevents NullPointerException
, it may introduce issues if empty arrays are not valid inputs for the KVStore.
Verify whether keyBytes
and valueBytes
can be empty arrays. If not, consider modifying the code to handle null
values appropriately, possibly by filtering out such records or throwing an exception to prevent invalid data from being processed.
6-22
: 🛠️ Refactor suggestion
Avoid using scala.concurrent.ExecutionContext.Implicits.global
in Spark jobs
Using the global execution context can lead to unpredicted behavior in Spark applications due to potential thread pool exhaustion or conflicts with Spark's own task scheduling.
Define a dedicated ExecutionContext
that's suitable for the workload and integrated with Spark's threading model.
Apply this diff to define a custom execution context:
-import scala.concurrent.ExecutionContext.Implicits.global
+import org.apache.spark.SparkContext
+import org.apache.spark.util.ThreadUtils
...
class SummaryUploader(summaryDF: DataFrame,
api: Api,
statsTableName: String = Constants.DriftStatsTable,
putsPerRequest: Int = 100,
requestsPerSecondPerExecutor: Int = 1000
)(implicit tu: TableUtils)
extends Serializable {
+ // Define an ExecutionContext using Spark's thread pool
+ implicit val executionContext = ExecutionContext.fromExecutorService(
+ SparkContext.getOrCreate().getExecutorMemoryStatus.keys.size match {
+ case 0 => ThreadUtils.newDaemonCachedThreadPool("SummaryUploader", 4)
+ case n => ThreadUtils.newDaemonCachedThreadPool("SummaryUploader", n * 4)
+ }
+ )
def run(): Unit = {
Ensure that you shut down the custom execution context when it's no longer needed to prevent resource leaks.
⛔ Skipped due to learnings
Learnt from: chewy-zlai
PR: zipline-ai/chronon#50
File: spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala:37-40
Timestamp: 2024-11-12T09:38:33.532Z
Learning: Avoid using `Await.result` in production code; prefer handling `Future`s asynchronously when possible to prevent blocking.
Learnt from: chewy-zlai
PR: zipline-ai/chronon#62
File: spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala:9-10
Timestamp: 2024-11-12T09:38:33.532Z
Learning: In Spark applications, when defining serializable classes, passing an implicit `ExecutionContext` parameter can cause serialization issues. In such cases, it's acceptable to use `scala.concurrent.ExecutionContext.Implicits.global`.
Learnt from: nikhil-zlai
PR: zipline-ai/chronon#50
File: spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala:19-47
Timestamp: 2024-11-12T09:38:33.532Z
Learning: In Scala, the `grouped` method on collections returns an iterator, allowing for efficient batch processing without accumulating all records in memory.
online/src/main/scala/ai/chronon/online/stats/DriftMetrics.scala (1)
108-109
: Consider handling zero probabilities in klDivergenceTerm
In the klDivergenceTerm
method, when a > 0
and b
is zero, the result should theoretically be infinite. Currently, the method returns zero in this case. Consider whether this behavior aligns with the intended statistical interpretation and whether an adjustment is necessary.
If adjusting is needed, you might handle the case where b == 0.0
separately.
api/src/main/scala/ai/chronon/api/Extensions.scala (1)
915-921
:
Prevent NullPointerException
by checking join.derivations
in renamedColumns
.
Accessing join.derivations
without a null check can lead to a NullPointerException
if it's null
. Ensure you safely handle this possibility.
Adjust the code to include a null check:
-Option(join.derivations)
- .map {
- _.toScala.renameOnlyDerivations.map(_.expression).toSet
- }
- .getOrElse(Set.empty)
+Option(join.derivations)
+ .map { derivations =>
+ derivations.toScala.renameOnlyDerivations.map(_.expression).toSet
+ }
+ .getOrElse(Set.empty)
Likely invalid or redundant comment.
spark/src/test/scala/ai/chronon/spark/test/stats/drift/PrepareData.scala (2)
7-8
: LGTM: Necessary imports added
The imports for DriftMetric
and DriftSpec
are appropriately added for the drift analysis enhancements.
134-138
: LGTM: driftSpec
integrated into metaData
The driftSpec
is correctly added to the metaData
in the join operation, enabling drift analysis configurations.
val DriftStatsTable: String = "drift_statistics" | ||
|
||
val DefaultDriftTileSize: Window = new Window(30, TimeUnit.MINUTES) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add missing Window import and document the tile size choice
Two issues to address:
- The
Window
class import is missing - The 30-minute default tile size needs documentation explaining the rationale
Add the following import at the top of the file:
+ import ai.chronon.core.Window
Consider adding a comment explaining why 30 minutes was chosen as the default tile size for drift calculations.
Committable suggestion skipped: line range outside the PR's diff.
val driftStore = new DriftStore(api.genKvStore, namespace) | ||
|
||
val tileKeys = driftStore.tileKeysForJoin(join) | ||
|
||
println(tileKeys) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add assertions for DriftStore results
The test prints tile keys but doesn't validate them. Consider adding assertions to verify the expected behavior.
val driftStore = new DriftStore(api.genKvStore, namespace)
val tileKeys = driftStore.tileKeysForJoin(join)
-println(tileKeys)
+// Add assertions
+assert(tileKeys.nonEmpty, "Tile keys should not be empty")
+assert(tileKeys.forall(_.contains(namespace)), s"All tile keys should contain namespace: $namespace")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
val driftStore = new DriftStore(api.genKvStore, namespace) | |
val tileKeys = driftStore.tileKeysForJoin(join) | |
println(tileKeys) | |
val driftStore = new DriftStore(api.genKvStore, namespace) | |
val tileKeys = driftStore.tileKeysForJoin(join) | |
// Add assertions | |
assert(tileKeys.nonEmpty, "Tile keys should not be empty") | |
assert(tileKeys.forall(_.contains(namespace)), s"All tile keys should contain namespace: $namespace") |
val kvStoreFunc: () => KVStore = () => { | ||
// cannot reuse the variable - or serialization error | ||
val result = InMemoryKvStore.build("drift_test", () => null) | ||
result | ||
} | ||
|
||
val api = new MockApi(kvStoreFunc, namespace) | ||
val kvStore = api.genKvStore | ||
kvStore.create(Constants.MetadataDataset) | ||
kvStore.create(Constants.DriftStatsTable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add cleanup for KVStore resources
The test creates KVStore tables but doesn't clean them up after the test. Consider adding cleanup in a finally
block or using a test fixture.
val kvStoreFunc: () => KVStore = () => {
val result = InMemoryKvStore.build("drift_test", () => null)
+ // Register cleanup hook
+ sys.addShutdownHook {
+ try {
+ result.drop(Constants.MetadataDataset)
+ result.drop(Constants.DriftStatsTable)
+ } catch {
+ case _: Exception => // Ignore cleanup errors
+ }
+ }
result
}
Committable suggestion skipped: line range outside the PR's diff.
@@ -49,7 +49,7 @@ jobs: | |||
path: api/py/htmlcov | |||
|
|||
other_spark_tests: | |||
runs-on: ubuntu-latest | |||
runs-on: ubuntu-8_cores-32_gb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Invalid runner label will break CI
The runner label ubuntu-8_cores-32_gb
is not a standard GitHub-hosted runner. This will cause the CI workflow to fail.
For GitHub-hosted runners with more resources, use:
ubuntu-latest-8-cores
for 8 core runnerubuntu-latest-16-cores
for 16 core runner
Apply this diff to use a valid GitHub-hosted runner:
- runs-on: ubuntu-8_cores-32_gb
+ runs-on: ubuntu-latest-8-cores
If you're using self-hosted runners:
- Ensure the runner label is properly configured
- Add an
actionlint.yaml
to specify custom runner labels
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
runs-on: ubuntu-8_cores-32_gb | |
runs-on: ubuntu-latest-8-cores |
🧰 Tools
🪛 actionlint
52-52: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-14.0", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-13.0", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "macos-12.0", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
Map( | ||
DriftMetric.JENSEN_SHANNON -> calculateDrift(DriftMetric.JENSEN_SHANNON), | ||
DriftMetric.PSI -> calculateDrift(DriftMetric.PSI), | ||
DriftMetric.HELLINGER -> calculateDrift(DriftMetric.HELLINGER) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add tests for PSI metric.
The PSI (Population Stability Index) metric is calculated but not tested in any test cases.
Add assertions for PSI in test cases similar to JSD and Hellinger.
def derivedColumns: Array[String] = | ||
Option(join.getDerivations) match { | ||
case Some(derivations) => | ||
derivations.toScala.flatMap { derivation => | ||
derivation.getName match { | ||
case "*" => None | ||
case _ => Some(derivation.getName) | ||
} | ||
} | ||
.flatMap(_.toSet)) | ||
.getOrElse(Seq.empty) | ||
}.toArray | ||
case None => Array.empty | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle potential null
names in derivedColumns
.
In the derivedColumns
method, derivation.getName
could be null
, leading to a NullPointerException
. Ensure that null
names are appropriately handled.
Modify the code to filter out null
names:
-derivations.toScala.flatMap { derivation =>
- derivation.getName match {
- case "*" => None
- case _ => Some(derivation.getName)
- }
-}.toArray
+derivations.toScala.flatMap { derivation =>
+ Option(derivation.getName) match {
+ case Some("*") => None
+ case Some(name) => Some(name)
+ case None => None
+ }
+}.toArray
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def derivedColumns: Array[String] = | |
Option(join.getDerivations) match { | |
case Some(derivations) => | |
derivations.toScala.flatMap { derivation => | |
derivation.getName match { | |
case "*" => None | |
case _ => Some(derivation.getName) | |
} | |
} | |
.flatMap(_.toSet)) | |
.getOrElse(Seq.empty) | |
}.toArray | |
case None => Array.empty | |
} | |
def derivedColumns: Array[String] = | |
Option(join.getDerivations) match { | |
case Some(derivations) => | |
derivations.toScala.flatMap { derivation => | |
Option(derivation.getName) match { | |
case Some("*") => None | |
case Some(name) => Some(name) | |
case None => None | |
} | |
}.toArray | |
case None => Array.empty | |
} |
def outputValueColumnsMap: Map[String, Array[String]] = { | ||
val preDeriveCols = (joinPartColumns ++ externalPartColumns) | ||
val preDerivedWithoutRenamed = preDeriveCols.mapValues(_.filterNot(renamedColumns.contains)) | ||
val derivedColumns: Array[String] = Option(join.derivations) match { | ||
case Some(derivations) => derivations.toScala.map { _.getName }.filter(_ == "*").toArray | ||
case None => Array.empty | ||
} | ||
preDerivedWithoutRenamed ++ Map("derivations" -> derivedColumns) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid using mapValues
to prevent lazy evaluation issues.
The mapValues
method in Scala returns a view, which can lead to unexpected behavior if the underlying map changes. It's safer to use map
to create a new map instead.
Replace mapValues
with map
:
-val preDerivedWithoutRenamed = preDeriveCols.mapValues(_.filterNot(renamedColumns.contains))
+val preDerivedWithoutRenamed = preDeriveCols.map { case (k, v) =>
+ k -> v.filterNot(renamedColumns.contains)
+}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def outputValueColumnsMap: Map[String, Array[String]] = { | |
val preDeriveCols = (joinPartColumns ++ externalPartColumns) | |
val preDerivedWithoutRenamed = preDeriveCols.mapValues(_.filterNot(renamedColumns.contains)) | |
val derivedColumns: Array[String] = Option(join.derivations) match { | |
case Some(derivations) => derivations.toScala.map { _.getName }.filter(_ == "*").toArray | |
case None => Array.empty | |
} | |
preDerivedWithoutRenamed ++ Map("derivations" -> derivedColumns) | |
} | |
def outputValueColumnsMap: Map[String, Array[String]] = { | |
val preDeriveCols = (joinPartColumns ++ externalPartColumns) | |
val preDerivedWithoutRenamed = preDeriveCols.map { case (k, v) => | |
k -> v.filterNot(renamedColumns.contains) | |
} | |
val derivedColumns: Array[String] = Option(join.derivations) match { | |
case Some(derivations) => derivations.toScala.map { _.getName }.filter(_ == "*").toArray | |
case None => Array.empty | |
} | |
preDerivedWithoutRenamed ++ Map("derivations" -> derivedColumns) | |
} |
val joinPartKeys = join.joinParts.toScala.flatMap(_.groupBy.keyColumns.toScala).toSet | ||
val externalKeys = join.onlineExternalParts.toScala.flatMap(_.source.keyNames).toSet | ||
val bootstrapKeys = join.bootstrapParts.toScala.flatMap(_.keyColumns.toScala).toSet | ||
(joinPartKeys ++ externalKeys ++ bootstrapKeys).toArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure safe access to collections in keyColumns
to prevent NullPointerException
.
Accessing join.onlineExternalParts
or join.bootstrapParts
without null checks can lead to NullPointerException
if they are null
. Safely handle these potential null
values.
Modify the code to include null checks:
-val externalKeys = join.onlineExternalParts.toScala.flatMap(_.source.keyNames).toSet
-val bootstrapKeys = join.bootstrapParts.toScala.flatMap(_.keyColumns.toScala).toSet
+val externalKeys = Option(join.onlineExternalParts)
+ .map(_.toScala.flatMap(_.source.keyNames).toSet)
+ .getOrElse(Set.empty)
+val bootstrapKeys = Option(join.bootstrapParts)
+ .map(_.toScala.flatMap(_.keyColumns.toScala).toSet)
+ .getOrElse(Set.empty)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
val joinPartKeys = join.joinParts.toScala.flatMap(_.groupBy.keyColumns.toScala).toSet | |
val externalKeys = join.onlineExternalParts.toScala.flatMap(_.source.keyNames).toSet | |
val bootstrapKeys = join.bootstrapParts.toScala.flatMap(_.keyColumns.toScala).toSet | |
(joinPartKeys ++ externalKeys ++ bootstrapKeys).toArray | |
val joinPartKeys = join.joinParts.toScala.flatMap(_.groupBy.keyColumns.toScala).toSet | |
val externalKeys = Option(join.onlineExternalParts) | |
.map(_.toScala.flatMap(_.source.keyNames).toSet) | |
.getOrElse(Set.empty) | |
val bootstrapKeys = Option(join.bootstrapParts) | |
.map(_.toScala.flatMap(_.keyColumns.toScala).toSet) | |
.getOrElse(Set.empty) | |
(joinPartKeys ++ externalKeys ++ bootstrapKeys).toArray |
def externalPartColumns: Map[String, Array[String]] = | ||
Option(join.onlineExternalParts) match { | ||
case Some(parts) => | ||
parts.toScala.map { part => | ||
val keys = part.source.getKeySchema.params.toScala.map(_.name) | ||
val values = part.source.getValueSchema.params.toScala.map(_.name) | ||
part.fullName -> (keys ++ values).toArray | ||
}.toMap | ||
case None => Map.empty | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure safe handling of potential null
values in externalPartColumns
.
When accessing part.source.getKeySchema
and part.source.getValueSchema
, there is a possibility of encountering null
values, which could lead to a NullPointerException
. It's important to safely handle these potential null
s.
Consider modifying the code to safely handle null
values:
-parts.toScala.map { part =>
- val keys = part.source.getKeySchema.params.toScala.map(_.name)
- val values = part.source.getValueSchema.params.toScala.map(_.name)
- part.fullName -> (keys ++ values).toArray
+parts.toScala.map { part =>
+ val keys = Option(part.source.getKeySchema)
+ .map(_.params.toScala.map(_.name))
+ .getOrElse(Seq.empty)
+ val values = Option(part.source.getValueSchema)
+ .map(_.params.toScala.map(_.name))
+ .getOrElse(Seq.empty)
+ part.fullName -> (keys ++ values).toArray
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def externalPartColumns: Map[String, Array[String]] = | |
Option(join.onlineExternalParts) match { | |
case Some(parts) => | |
parts.toScala.map { part => | |
val keys = part.source.getKeySchema.params.toScala.map(_.name) | |
val values = part.source.getValueSchema.params.toScala.map(_.name) | |
part.fullName -> (keys ++ values).toArray | |
}.toMap | |
case None => Map.empty | |
} | |
def externalPartColumns: Map[String, Array[String]] = | |
Option(join.onlineExternalParts) match { | |
case Some(parts) => | |
parts.toScala.map { part => | |
val keys = Option(part.source.getKeySchema) | |
.map(_.params.toScala.map(_.name)) | |
.getOrElse(Seq.empty) | |
val values = Option(part.source.getValueSchema) | |
.map(_.params.toScala.map(_.name)) | |
.getOrElse(Seq.empty) | |
part.fullName -> (keys ++ values).toArray | |
}.toMap | |
case None => Map.empty | |
} |
val maxGap = totalDays - windowLengths.sum | ||
val gapDays = RandomUtils.between(2, maxGap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle potential negative maxGap
to prevent exceptions
There's a risk that maxGap
could be negative if totalDays
is less than the sum of windowLengths
. This would cause RandomUtils.between(2, maxGap)
to throw an exception.
Consider adding a check to ensure maxGap
is sufficient:
val windowLengths = List.fill(numWindows)(RandomUtils.between(3, 8))
val maxGap = totalDays - windowLengths.sum
+if (maxGap < 2) {
+ throw new IllegalArgumentException("Insufficient total days to generate non-overlapping windows with the given window lengths.")
+}
val gapDays = RandomUtils.between(2, maxGap)
Committable suggestion skipped: line range outside the PR's diff.
|
||
// note: the interval convention is left inclusive, right exclusive | ||
// please see the test case to understand the algorithm better | ||
// at a high level we are converting equally percentile points into histograms / distributions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nikhil-zlai - dropped some comments based on todos etc. Feel free to bounce over
@@ -267,7 +267,8 @@ object Builders { | |||
samplePercent: Double = 100, | |||
consistencySamplePercent: Double = 5, | |||
tableProperties: Map[String, String] = Map.empty, | |||
historicalBackill: Boolean = true | |||
historicalBackill: Boolean = true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/backill/backfill
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -37,7 +37,7 @@ object Constants { | |||
val ChrononDynamicTable = "chronon_dynamic_table" | |||
val ChrononOOCTable: String = "chronon_ooc_table" | |||
val ChrononLogTable: String = "chronon_log_table" | |||
val ChrononMetadataKey = "ZIPLINE_METADATA" | |||
val MetadataDataset = "chronon_metadata" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make uppercase for consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -62,5 +62,9 @@ object Constants { | |||
val LabelViewPropertyFeatureTable: String = "feature_table" | |||
val LabelViewPropertyKeyLabelTable: String = "label_table" | |||
val ChrononRunDs: String = "CHRONON_RUN_DS" | |||
|
|||
val TiledSummaryDataset: String = "tiled_summaries" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uppercase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
import scalatags.Text.all._ | ||
import scalatags.Text.tags2.title | ||
|
||
// generates html / js code to serve a tabbed board on the network port |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
var result = 0.0 | ||
|
||
// debug only, remove before merging | ||
val deltas = Array.ofDim[Double](aNormalized.length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
val delta = func(ai, bi) | ||
|
||
// debug only remove before merging | ||
deltas.update(i, delta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here too - remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done - good catch!
_ match { | ||
case Success(responseContext) => Some(responseContext) | ||
// TODO instrument failures | ||
case Failure(exception) => exception.printStackTrace(); None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we address in current PR? I can also pick up in my follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
works for me
getJoinConf(join).map { joinConf => | ||
// TODO-explore: we might be over fetching if lookBack is much larger than end - start | ||
// if we can merge tile summary infos, we can call this twice with different ranges and merge | ||
getSummaries(joinConf.join, Some(startMs - lookBack.millis), Some(endMs)).map { tileSummaryInfos => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also subtract lookback.millis from endMs? IIUC we want the full start to endMs time range shifted by lookback right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so lets say we want drift between last week and today, we need summaries ranging from two weeks ago to today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking we'd need two potentially disjointed timeframes. For example, if the user is looking at say the month of 2024-12-01 to 2025-01-01 and they ask for an offset of 3m. Don't we need one range of: 2024-09-01 to 2024-10-01 and the other range to be 2024-12-01 to 2025-01-01?
I think what you're getting at also works - it sounds like the lookback window to compute drift over though semantics / UX might be different. A case where I think the disjointed approach might showcase things better might be say you want to see how payment features drifted over BFCM week last year and this year - you could do a 1 week or 2 week time range and include a 1 year offset so we only compared last 2 weeks of Nov last year with last 2 weeks of Nov this year. That might cause issues to pop more explicitly than the drift over the whole of last year ending in Nov of this year?
4cc9851
to
2eaced3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
🧹 Outside diff range and nitpick comments (28)
hub/conf/routes (4)
6-7
: Consider making the algorithm parameter optional and add documentation.The new model timeseries endpoint looks good, but consider:
- Making
algorithm
an optional parameter like in the join/feature endpoints- Adding route comments to document the purpose of each parameter
-# model prediction & model drift -GET /api/v1/model/:id/timeseries controllers.TimeSeriesController.fetchModel(id: String, startTs: Long, endTs: Long, offset: String, algorithm: String) +# Fetch model prediction & drift metrics +# Parameters: +# id: Model identifier +# startTs: Start timestamp for the time window +# endTs: End timestamp for the time window +# offset: Time offset for the window +# algorithm: (Optional) Drift detection algorithm +GET /api/v1/model/:id/timeseries controllers.TimeSeriesController.fetchModel(id: String, startTs: Long, endTs: Long, offset: String, algorithm: Option[String])
9-11
: Enhance the endpoint documentation.The join timeseries endpoint is well-structured, but the documentation could be more comprehensive.
-# all timeseries of a given join id -# when metricType == "drift" - will return time series list of drift values +# Fetch timeseries data for a specific join +# Parameters: +# name: Join identifier +# startTs: Start timestamp for the time window +# endTs: End timestamp for the time window +# metricType: Type of metric to fetch (e.g., "drift" returns time series of drift values) +# metrics: Specific metrics to include +# offset: (Optional) Time offset for the window +# algorithm: (Optional) Drift detection algorithm
13-15
: Clarify the relationship between joins and features.The feature timeseries endpoint looks good, but:
- The comment "join -> seq(feature)" needs elaboration
- The purpose of the
granularity
parameter should be documented-# join -> seq(feature) -# when metricType == "drift" - will return time series list of drift values +# Fetch timeseries data for a specific feature +# Note: Features are components of joins, where one join contains multiple features +# Parameters: +# name: Feature identifier +# startTs: Start timestamp for the time window +# endTs: End timestamp for the time window +# metricType: Type of metric to fetch (e.g., "drift" returns time series of drift values) +# metrics: Specific metrics to include +# granularity: Time granularity for the data points +# offset: (Optional) Time offset for the window +# algorithm: (Optional) Drift detection algorithm
18-26
: Review the architectural approach for fine-grained endpoints.The TODO section suggests breaking down the feature endpoint into more specific ones, but there are some concerns:
- All commented endpoints currently map to the same controller method with identical parameters
- Each specialized endpoint (percentiles, values, drift, etc.) likely needs different parameters and return types
Consider:
- Creating dedicated controller methods for each endpoint type
- Defining specific parameter sets appropriate for each operation
- Creating a tracking issue for this architectural change
Would you like me to help create a GitHub issue to track this architectural task?
spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala (1)
39-45
: Consider consolidating schema validationThe schema validation could be more concise and maintainable.
Consider this approach:
private case class ColumnSpec(name: String, expectedType: DataType) private val requiredColumns = Seq( ColumnSpec("keyBytes", BinaryType), ColumnSpec("valueBytes", BinaryType), ColumnSpec("timestamp", LongType) ) requiredColumns.foreach { spec => val idx = summaryDF.schema.fieldIndex(spec.name) require(summaryDF.schema(idx).dataType == spec.expectedType, s"${spec.name} must be ${spec.expectedType}") }spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala (4)
Line range hint
34-41
: Add cleanup for SparkSession and database resourcesConsider adding cleanup in an
afterAll
hook to ensure proper resource cleanup after tests complete.class DriftTest extends AnyFlatSpec with Matchers { + override def afterAll(): Unit = { + try { + tableUtils.dropDatabase(namespace, cascade = true) + spark.stop() + } finally { + super.afterAll() + } + }
Line range hint
43-68
: Add test assertions and extract constantsThe test lacks assertions to verify the behavior and uses hardcoded values.
+ private val DefaultRowCount = 100000 + private val DefaultPartitions = 10 + "TestDataFrameGenerator" should "create a DataFrame with various column types including nulls" in { try { - val df = generateDataFrame(100000, 10, namespace) + val df = generateDataFrame(DefaultRowCount, DefaultPartitions, namespace) // Check if DataFrame is created - df.isEmpty shouldBe false + df.count() shouldBe DefaultRowCount df.show(10, truncate = false) val summarizer = new Summarizer("drift_test_basic") val (result, summaryExprs) = summarizer.computeSummaryDf(df) + result should not be null + summaryExprs should not be empty result.show()
Line range hint
70-81
: Consider using a proper logging frameworkInstead of using
println
with color formatting, consider using a proper logging framework like Logback or Log4j for better test output control and consistency across environments.+ private val logger = org.slf4j.LoggerFactory.getLogger(getClass) + def showTable(name: String)(implicit tableUtils: TableUtils): Unit = { - println(s"Showing table $name".yellow) + logger.info(s"Showing table: $name") val df = tableUtils.loadTable(name) val maxColNameLength = df.schema.fieldNames.map(_.length).max def pad(s: String): String = s.padTo(maxColNameLength, ' ') df.schema.fields.foreach { f => - println(s" ${pad(f.name)} : ${f.dataType.typeName}".yellow) + logger.info(s" ${pad(f.name)} : ${f.dataType.typeName}") }
85-89
: Extract test configuration to constantsHardcoded values should be extracted to constants for better maintainability and configurability.
+ private object TestConfig { + val SampleSize = 100000 + val StartDate = "2023-01-01" + val EndDate = "2023-01-30" + } + "end_to_end" should "fetch prepare anomalous data, summarize, upload and fetch without failures" in { val prepareData = PrepareData(namespace) val join = prepareData.generateAnomalousFraudJoin - val df = prepareData.generateFraudSampleData(100000, "2023-01-01", "2023-01-30", join.metaData.loggedTable) + val df = prepareData.generateFraudSampleData( + TestConfig.SampleSize, + TestConfig.StartDate, + TestConfig.EndDate, + join.metaData.loggedTable + )online/src/main/scala/ai/chronon/online/stats/PivotUtils.scala (2)
16-19
: Remove outdated TODO commentThe TODO comment about
TileDrift => TileDriftSeries
can be removed since the implementation is already present in this file.
141-156
: Improve readability of long method chainsThe long method chains could be more readable with proper formatting. Consider breaking them into multiple lines with proper indentation.
Example formatting:
- .setNullRatioChangePercentSeries(doubleIterator(_.isSetNullRatioChangePercent, _.getNullRatioChangePercent) |> collectDoubles) + .setNullRatioChangePercentSeries( + doubleIterator( + _.isSetNullRatioChangePercent, + _.getNullRatioChangePercent + ) |> collectDoubles + )online/src/main/scala/ai/chronon/online/stats/DriftStore.scala (5)
35-35
: Enhance error message for better debugging.Consider providing more context in the error message to help diagnose configuration issues.
-require(tileSize.nonEmpty, s"Drift tile size not set for join $joinName") +require(tileSize.nonEmpty, s"Drift tile size configuration missing for join '$joinName'. Please ensure drift_tile_size is set in the join metadata.")
61-65
: Add validation for empty summaries array.The
toDriftSeries
method should validate the input array to handle edge cases gracefully.def toDriftSeries(driftMetric: DriftMetric, lookBack: Window, startMs: Long): TileDriftSeries = { + require(summaries.nonEmpty, s"Cannot create drift series for empty summaries array with key: ${key}") val driftsArray = TileDriftCalculator.toTileDrifts(summaries, driftMetric, startMs, lookBack) val result = PivotUtils.pivot(driftsArray) result.setKey(key) }
114-114
: Address TODO for failure instrumentation.The TODO comment indicates missing failure metrics. Consider adding metrics to track:
- Failure rates
- Error types
- Latency distribution
Would you like me to help implement the failure instrumentation using a metrics framework?
138-140
: Consider optimizing data fetching strategy.The TODO comment raises a valid concern about over-fetching. Consider implementing a strategy to:
- Split the fetch into two ranges
- Merge the results efficiently
- Cache frequently accessed ranges
Would you like help designing an optimized fetching strategy?
165-166
: Document the compact protocol bug.The comment warns about a bug but lacks details. Please document:
- The specific issue with compact protocol
- Any related bug tickets or references
- Potential future migration path
online/src/main/scala/ai/chronon/online/stats/Display.scala (2)
12-25
: Consider implementing a consistent color paletteThe case class hierarchy is well-designed, but the random color generation in chart rendering (used later) could lead to readability issues or poor color contrast. Consider implementing a predefined color palette for better visualization consistency.
Example implementation:
private val ColorPalette = Array( "rgb(66, 133, 244)", // Google Blue "rgb(219, 68, 55)", // Google Red "rgb(244, 180, 0)", // Google Yellow "rgb(15, 157, 88)", // Google Green // Add more colors as needed ) // Then use ColorPalette(index % ColorPalette.length) instead of random colors
1-205
: Consider adding error handling and monitoring capabilitiesAs this is part of the drift metrics implementation, consider these architectural improvements:
- Add error handling for failed chart rendering
- Implement monitoring endpoints for server health
- Consider making the visualization dynamic for real-time drift metrics updates
Would you like me to provide a detailed implementation plan for these improvements?
spark/src/test/scala/ai/chronon/spark/test/OnlineUtils.scala (2)
Line range hint
142-156
: Improve robustness of async handling inputStreamingNew
The current implementation has several concerns:
- The hardcoded sleep duration of 5000ms may be insufficient for larger datasets
- No error handling for async operations
- Vague documentation about the async behavior
Consider:
- Making the sleep duration configurable
- Adding proper error handling
- Improving documentation about the async behavior
def putStreamingNew(originalGroupByConf: api.GroupBy, ds: String, namespace: String, kvStoreFunc: () => KVStore, - debug: Boolean)(implicit session: SparkSession): Unit = { + debug: Boolean, + asyncWaitMs: Long = 5000)(implicit session: SparkSession): Unit = { implicit val mockApi = new MockApi(kvStoreFunc, namespace) val groupByConf = originalGroupByConf.deepCopy() val source = groupByConf.streamingSource.get mutateTopicWithDs(source, ds) val groupByStreaming = new JoinSourceRunner(groupByConf, Map.empty, debug = debug, lagMillis = 0) val query = groupByStreaming.chainedStreamingQuery.trigger(Trigger.Once()).start() - // drain work scheduled as futures over the executioncontext - Thread.sleep(5000) - // there is async stuff under the hood of chained streaming query - query.awaitTermination() + try { + // Wait for async operations in the chained streaming query to complete + // These operations include future completions in the execution context + Thread.sleep(asyncWaitMs) + query.awaitTermination() + } catch { + case e: Exception => + query.stop() + throw new RuntimeException(s"Failed to process streaming query: ${e.getMessage}", e) + } }Also, consider adding more context to the TODO comment about deprecating
putStreaming
:-// TODO - deprecate putStreaming +// TODO: Deprecate putStreaming in favor of putStreamingNew which provides better support for join sources +// Timeline: Remove in next major version
Line range hint
1-187
: Enhance code documentation and consistencyThe test utilities would benefit from:
- ScalaDoc comments for public methods explaining their purpose and usage in tests
- Consistent error handling patterns across streaming operations
- Clear documentation of test setup requirements
Consider adding ScalaDoc comments for key methods, for example:
/** Processes streaming data with join source support. * * @param originalGroupByConf The group by configuration to process * @param ds The dataset identifier * @param namespace The namespace for the mock API * @param kvStoreFunc Function to create KVStore instances * @param debug Enable debug logging * @param session Implicit SparkSession */ def putStreamingNew(...)Also consider:
- Creating a common error handling trait/interface for streaming operations
- Adding integration tests for the different streaming scenarios
- Documenting the test data requirements and setup in the README
spark/src/main/scala/ai/chronon/spark/stats/drift/Expressions.scala (2)
111-137
: Consider extracting common conversion patternsThe metric population logic has repeated patterns for converting Scala collections to Java collections. Consider extracting these conversions into helper methods for better maintainability.
+ private object Conversions { + def toJavaDoubles(seq: Seq[Double]): java.util.List[java.lang.Double] = + seq.map(java.lang.Double.valueOf).toJava + + def toJavaIntegers(seq: Seq[Int]): java.util.List[java.lang.Integer] = + seq.map(java.lang.Integer.valueOf).toJava + } case MetricName.percentiles => (row: Row, summaries: TileSummary) => - summaries.setPercentiles(row.getSeq[Double](index).map(lang.Double.valueOf).toJava) + summaries.setPercentiles(Conversions.toJavaDoubles(row.getSeq[Double](index))) case MetricName.lengthPercentiles => (row: Row, summaries: TileSummary) => - summaries.setLengthPercentiles(row.getSeq[Int](index).map(lang.Integer.valueOf).toJava) + summaries.setLengthPercentiles(Conversions.toJavaIntegers(row.getSeq[Int](index)))
Line range hint
173-174
: Consider addressing TODOs for map key cardinalityThere are TODO comments about handling map key cardinality and heavy hitters. Since this PR is about drift metrics, would you like me to create GitHub issues to track these improvements?
spark/src/main/scala/ai/chronon/spark/stats/drift/Summarizer.scala (1)
Line range hint
1-324
: Consider enhancing test coverage and documentationAs this is a WIP PR implementing drift metrics functionality:
The code handles critical time series data processing and serialization. Consider adding:
- Unit tests for edge cases in time window calculations
- Integration tests for the entire drift metrics pipeline
- Performance benchmarks for large datasets
Documentation improvements needed:
- Add scaladoc comments explaining the drift metrics algorithm
- Document the expected format of input/output data
- Include examples of typical usage patterns
Would you like me to help generate:
- Unit test templates for the
Summarizer
andSummaryPacker
classes?- Documentation templates with usage examples?
spark/src/test/scala/ai/chronon/spark/test/stats/drift/PrepareData.scala (2)
132-138
: Consider enhancing the DriftSpec configurationThe current configuration has some potential improvements:
- Consider using Scala collections instead of Java's ArrayList
- The single 30-minute lookback window might be limiting for drift detection
- The hard-coded values would benefit from named constants or configuration parameters
- val driftSpec = new DriftSpec() - driftSpec.setTileSize(new Window(30, TimeUnit.MINUTES)) - driftSpec.setDriftMetric(DriftMetric.JENSEN_SHANNON) - val windows = new java.util.ArrayList[Window]() - windows.add(new Window(30, TimeUnit.MINUTES)) - driftSpec.setLookbackWindows(windows) + val DefaultTileSize = 30 + val DefaultLookbackWindows = Seq( + new Window(30, TimeUnit.MINUTES), + new Window(1, TimeUnit.HOURS), + new Window(4, TimeUnit.HOURS) + ) + + val driftSpec = new DriftSpec() + .setTileSize(new Window(DefaultTileSize, TimeUnit.MINUTES)) + .setDriftMetric(DriftMetric.JENSEN_SHANNON) + .setLookbackWindows(DefaultLookbackWindows.asJava)
299-311
: Enhance schema validation error messageWhile the validation is thorough, the error message could be more descriptive.
- require(matching == expected, s"Schema mismatch: expected $expected, got $actual. Missing: $missing, Extra: $extra") + require( + matching == expected, + s"""Schema validation failed: + |Expected columns: ${expected.toSeq.sorted.mkString(", ")} + |Actual columns: ${actual.toSeq.sorted.mkString(", ")} + |Missing columns: ${missing.toSeq.sorted.mkString(", ")} + |Extra columns: ${extra.toSeq.sorted.mkString(", ")} + |""".stripMargin + )api/thrift/api.thrift (3)
Line range hint
253-273
: Add comprehensive documentation for TileSummary and TileSeriesKey structs.The structures would benefit from detailed documentation explaining:
- The purpose and use cases of each field
- The relationship between these structures
- The meaning of "inner" counts vs regular counts
- The significance of percentiles and histograms in the context of drift detection
+/** + * Statistical summary of a data tile, containing various metrics used for drift detection. + * A tile represents a snapshot of data within a specific time window. + */ struct TileSummary { + // Percentile distribution of numeric values + // Frequency distribution of categorical values + // Total number of records + // Number of null values + // Metrics specific to container types (lists, maps, etc.) + // Metrics specific to string values } +/** + * Unique identifier for a tile series, combining column, slice, and grouping information + */ struct TileSeriesKey {
275-292
: Consider restructuring TileSummarySeries for better maintainability.The current structure duplicates the fields from TileSummary. Consider using a list of TileSummary objects instead, which would:
- Reduce code duplication
- Make it easier to maintain (changes to TileSummary automatically reflect here)
- Improve code readability
// array of tuples of (TileSummary, timestamp) ==(pivot)==> TileSummarySeries struct TileSummarySeries { - 1: optional list<list<double>> percentiles - 2: optional map<string, list<i64>> histogram - 3: optional list<i64> count - 4: optional list<i64> nullCount - 5: optional list<i64> innerCount - 6: optional list<i64> innerNullCount - 7: optional list<list<i32>> lengthPercentiles - 8: optional list<list<i32>> stringLengthPercentiles + 1: optional list<TileSummary> summaries }
344-356
: Enhance documentation for DriftSpec configuration.The DriftSpec structure would benefit from more detailed documentation explaining:
- The relationship between
tileSize
andlookbackWindows
- The impact of different drift metrics and their thresholds
- Best practices for setting cardinality hints
struct DriftSpec { + /** + * Additional dimensions for data summarization beyond column & slice. + * Currently limited to one slice. + */ + /** + * Custom expressions for monitoring specific scenarios: + * - Conditional metrics: e.g., "IF(merchant = 'shopify', txn_amount, NULL)" + * - Derived ratios: e.g., "likes_over_dislikes = IF(dislikes > likes, 1, 0)" + */ + /** + * Override automatic cardinality detection for specific columns. + * Useful when the training data doesn't reflect the production cardinality. + */ + /** + * Size of the time window for computing tile summaries. + */ + /** + * Windows for comparing current tile summary with historical summaries. + * Multiple windows allow detection of both short-term and long-term drift. + */ + /** + * Default drift metric with industry-standard thresholds: + * - Moderate drift: 0.05 - 0.1 + * - Severe drift: > 0.1 + */ }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (47)
.github/workflows/test_scala_and_python.yaml
(2 hunks)api/src/main/scala/ai/chronon/api/Builders.scala
(2 hunks)api/src/main/scala/ai/chronon/api/Constants.scala
(2 hunks)api/src/main/scala/ai/chronon/api/Extensions.scala
(2 hunks)api/src/main/scala/ai/chronon/api/ThriftJsonCodec.scala
(2 hunks)api/thrift/api.thrift
(4 hunks)build.sbt
(1 hunks)cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala
(2 hunks)cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala
(1 hunks)hub/app/controllers/TimeSeriesController.scala
(4 hunks)hub/app/model/Model.scala
(1 hunks)hub/app/store/DynamoDBMonitoringStore.scala
(1 hunks)hub/conf/routes
(1 hunks)online/src/main/scala/ai/chronon/online/Api.scala
(3 hunks)online/src/main/scala/ai/chronon/online/Fetcher.scala
(1 hunks)online/src/main/scala/ai/chronon/online/FetcherBase.scala
(2 hunks)online/src/main/scala/ai/chronon/online/MetadataStore.scala
(3 hunks)online/src/main/scala/ai/chronon/online/stats/AssignIntervals.scala
(1 hunks)online/src/main/scala/ai/chronon/online/stats/Display.scala
(1 hunks)online/src/main/scala/ai/chronon/online/stats/DriftMetrics.scala
(1 hunks)online/src/main/scala/ai/chronon/online/stats/DriftStore.scala
(1 hunks)online/src/main/scala/ai/chronon/online/stats/PivotUtils.scala
(1 hunks)online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala
(1 hunks)online/src/test/scala/ai/chronon/online/test/stats/AssignIntervalsTest.scala
(1 hunks)online/src/test/scala/ai/chronon/online/test/stats/DriftMetricsTest.scala
(1 hunks)online/src/test/scala/ai/chronon/online/test/stats/PivotUtilsTest.scala
(1 hunks)spark/src/main/scala/ai/chronon/spark/Driver.scala
(0 hunks)spark/src/main/scala/ai/chronon/spark/LabelJoin.scala
(2 hunks)spark/src/main/scala/ai/chronon/spark/SparkSessionBuilder.scala
(1 hunks)spark/src/main/scala/ai/chronon/spark/stats/SummaryJob.scala
(0 hunks)spark/src/main/scala/ai/chronon/spark/stats/drift/DistanceMetrics.scala
(0 hunks)spark/src/main/scala/ai/chronon/spark/stats/drift/Expressions.scala
(4 hunks)spark/src/main/scala/ai/chronon/spark/stats/drift/Summarizer.scala
(2 hunks)spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala
(1 hunks)spark/src/test/scala/ai/chronon/spark/test/ChainingFetcherTest.scala
(2 hunks)spark/src/test/scala/ai/chronon/spark/test/ExternalSourcesTest.scala
(2 hunks)spark/src/test/scala/ai/chronon/spark/test/FetchStatsTest.scala
(1 hunks)spark/src/test/scala/ai/chronon/spark/test/FetcherTest.scala
(5 hunks)spark/src/test/scala/ai/chronon/spark/test/InMemoryKvStore.scala
(3 hunks)spark/src/test/scala/ai/chronon/spark/test/JoinTest.scala
(0 hunks)spark/src/test/scala/ai/chronon/spark/test/OnlineUtils.scala
(1 hunks)spark/src/test/scala/ai/chronon/spark/test/SchemaEvolutionTest.scala
(2 hunks)spark/src/test/scala/ai/chronon/spark/test/StreamingTest.scala
(2 hunks)spark/src/test/scala/ai/chronon/spark/test/bootstrap/DerivationTest.scala
(1 hunks)spark/src/test/scala/ai/chronon/spark/test/bootstrap/LogBootstrapTest.scala
(1 hunks)spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala
(3 hunks)spark/src/test/scala/ai/chronon/spark/test/stats/drift/PrepareData.scala
(11 hunks)
💤 Files with no reviewable changes (4)
- spark/src/main/scala/ai/chronon/spark/Driver.scala
- spark/src/main/scala/ai/chronon/spark/stats/SummaryJob.scala
- spark/src/main/scala/ai/chronon/spark/stats/drift/DistanceMetrics.scala
- spark/src/test/scala/ai/chronon/spark/test/JoinTest.scala
🚧 Files skipped from review as they are similar to previous changes (28)
- api/src/main/scala/ai/chronon/api/Builders.scala
- api/src/main/scala/ai/chronon/api/Constants.scala
- api/src/main/scala/ai/chronon/api/ThriftJsonCodec.scala
- build.sbt
- cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala
- cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala
- hub/app/model/Model.scala
- hub/app/store/DynamoDBMonitoringStore.scala
- online/src/main/scala/ai/chronon/online/Api.scala
- online/src/main/scala/ai/chronon/online/Fetcher.scala
- online/src/main/scala/ai/chronon/online/FetcherBase.scala
- online/src/main/scala/ai/chronon/online/MetadataStore.scala
- online/src/main/scala/ai/chronon/online/stats/AssignIntervals.scala
- online/src/main/scala/ai/chronon/online/stats/DriftMetrics.scala
- online/src/test/scala/ai/chronon/online/test/stats/AssignIntervalsTest.scala
- online/src/test/scala/ai/chronon/online/test/stats/DriftMetricsTest.scala
- online/src/test/scala/ai/chronon/online/test/stats/PivotUtilsTest.scala
- spark/src/main/scala/ai/chronon/spark/LabelJoin.scala
- spark/src/main/scala/ai/chronon/spark/SparkSessionBuilder.scala
- spark/src/test/scala/ai/chronon/spark/test/ChainingFetcherTest.scala
- spark/src/test/scala/ai/chronon/spark/test/ExternalSourcesTest.scala
- spark/src/test/scala/ai/chronon/spark/test/FetchStatsTest.scala
- spark/src/test/scala/ai/chronon/spark/test/FetcherTest.scala
- spark/src/test/scala/ai/chronon/spark/test/InMemoryKvStore.scala
- spark/src/test/scala/ai/chronon/spark/test/SchemaEvolutionTest.scala
- spark/src/test/scala/ai/chronon/spark/test/StreamingTest.scala
- spark/src/test/scala/ai/chronon/spark/test/bootstrap/DerivationTest.scala
- spark/src/test/scala/ai/chronon/spark/test/bootstrap/LogBootstrapTest.scala
🧰 Additional context used
📓 Learnings (2)
spark/src/main/scala/ai/chronon/spark/stats/drift/Expressions.scala (1)
Learnt from: nikhil-zlai
PR: zipline-ai/chronon#17
File: spark/src/main/scala/ai/chronon/spark/stats/drift/Expressions.scala:76-102
Timestamp: 2024-11-12T09:38:33.532Z
Learning: In performance-critical sections of the codebase, prefer mutable data structures over immutable ones for better performance.
spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala (3)
Learnt from: chewy-zlai
PR: zipline-ai/chronon#50
File: spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala:37-40
Timestamp: 2024-11-12T09:38:33.532Z
Learning: Avoid using `Await.result` in production code; prefer handling `Future`s asynchronously when possible to prevent blocking.
Learnt from: chewy-zlai
PR: zipline-ai/chronon#62
File: spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala:9-10
Timestamp: 2024-11-12T09:38:33.532Z
Learning: In Spark applications, when defining serializable classes, passing an implicit `ExecutionContext` parameter can cause serialization issues. In such cases, it's acceptable to use `scala.concurrent.ExecutionContext.Implicits.global`.
Learnt from: nikhil-zlai
PR: zipline-ai/chronon#50
File: spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala:19-47
Timestamp: 2024-11-12T09:38:33.532Z
Learning: In Scala, the `grouped` method on collections returns an iterator, allowing for efficient batch processing without accumulating all records in memory.
🪛 actionlint
.github/workflows/test_scala_and_python.yaml
52-52: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-14.0", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-13.0", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "macos-12.0", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
🔇 Additional comments (34)
hub/conf/routes (1)
Line range hint 1-26
: Changes align well with PR objectives.
The removal of slice-specific endpoints (as mentioned in the previous review) and the introduction of new timeseries endpoints align well with the PR's drift metrics focus. Since this is a WIP PR, consider:
- Adding integration tests for the new endpoints
- Updating API documentation to reflect these changes
spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala (2)
16-21
: Parameter requestsPerSecondPerExecutor
is defined but not used
The rate limiting functionality is still marked as TODO and not implemented, despite having the parameter in place.
21-22
: Consider serialization implications of implicit TableUtils
The class extends Serializable
but takes an implicit TableUtils
parameter. This could lead to serialization issues if TableUtils
or its dependencies are not serializable.
.github/workflows/test_scala_and_python.yaml (2)
52-52
:
Critical: Invalid runner label will break CI
The runner label ubuntu-8_cores-32_gb
is not a standard GitHub-hosted runner. This will cause the CI workflow to fail.
For GitHub-hosted runners with more resources, use:
ubuntu-latest-8-cores
for 8 core runnerubuntu-latest-16-cores
for 16 core runner
Apply this diff to use a valid GitHub-hosted runner:
- runs-on: ubuntu-8_cores-32_gb
+ runs-on: ubuntu-latest-8-cores
If you're using self-hosted runners:
- Ensure the runner label is properly configured
- Add an
actionlint.yaml
to specify custom runner labels
Likely invalid or redundant comment.
🧰 Tools
🪛 actionlint
52-52: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-14.0", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-13.0", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "macos-12.0", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
67-67
: Verify the significant memory increase
The SBT max heap size has been increased from 8G to 24G, which is a significant 3x increase. While this might be necessary for the new drift metrics calculations:
- Please provide justification for this large increase
- Consider if this is the minimum required memory
- Document any observed OOM issues that necessitated this change
✅ Verification successful
Based on the gathered information, I can now provide a final response:
Memory increase appears justified due to data-intensive operations
The codebase shows extensive use of memory-intensive Spark operations across multiple components:
- Large-scale data processing with DataFrames/Datasets in drift metrics calculation (
spark/src/main/scala/ai/chronon/spark/stats/drift/
) - Multiple caching and persistence operations across various components
- Complex join operations and analytics in test suites
The change history shows this was increased from 8G to 24G, and given the extensive data processing evident in the codebase, particularly around drift metrics and analytics, this memory increase appears necessary to handle the workload reliably.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for memory-related issues in recent test runs
# Look for OOM errors in recent workflow runs
gh run list --workflow=test_scala_and_python.yaml --limit 10 --json conclusion,url | \
jq -r '.[] | select(.conclusion=="failure") | .url' | \
while read -r url; do
gh run view "$url" --log | grep -i "out of memory\|java.lang.OutOfMemoryError"
done
# Check test files for memory-intensive operations
rg -l "drift.*metrics" --type scala
Length of output: 334
Script:
#!/bin/bash
# Let's try a different approach to understand memory usage patterns
# Check git history for memory-related changes in the workflow file
git log -p -- .github/workflows/test_scala_and_python.yaml | grep -B5 -A5 "SBT_OPTS.*-Xmx"
# Look for large data processing or memory-intensive operations
rg -l "DataFrame|Dataset|collect|cache|persist" --type scala
# Search for test configurations and memory settings
fd "\.conf$|\.properties$|\.sbt$" --exec grep -l "memory|heap|Xmx|Xms" {} \;
# Look for potential memory-intensive test files
fd "Test\.scala$|Spec\.scala$" --exec grep -l "test\|benchmark\|performance" {} \;
Length of output: 16930
online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala (1)
90-91
: Consider performance optimization for large datasets.
The current implementation creates a map of all summaries, which could be memory-intensive for large datasets. Consider implementing the single-pass solution mentioned in the TODO comment.
Let's check if there are any large datasets being processed:
online/src/main/scala/ai/chronon/online/stats/PivotUtils.scala (4)
1-15
: LGTM! Clean import organization
Imports are well-structured with clear aliasing for Java types and proper Scala collection conversions.
77-79
: LGTM! Clean pipe operator implementation
The PipeOps
implementation provides a clean way to chain operations, improving code readability.
81-89
: LGTM! Robust input handling and sorting
The implementation properly handles empty input and ensures consistent ordering through timestamp sorting.
122-130
: LGTM! Consistent implementation pattern
The implementation follows the same robust pattern as the TileSummary pivot, ensuring consistent behavior across different types.
online/src/main/scala/ai/chronon/online/stats/DriftStore.scala (3)
1-27
: LGTM: Class structure and imports are well-organized.
The class definition is clean with appropriate inheritance and constructor parameters.
148-159
: LGTM: Clean implementation of summary series retrieval.
The method is well-structured and maintains a consistent pattern with getDriftSeries.
140-140
: Verify startMs calculation with lookback window.
The current implementation subtracts lookback.millis
from startMs
. Please verify if this provides the correct time range for drift calculation.
✅ Verification successful
These scripts will help us understand:
- How lookback windows are generally handled
- Time window calculations patterns
- The broader context of startMs usage
- The DriftStore class implementation details
Based on the search results, I can now provide a conclusive response about the lookback window calculation verification:
The lookback window calculation is correctly implemented.
The implementation is verified to be correct based on the following evidence:
-
In
TileDriftCalculator.scala
, we can see the drift calculation logic that confirms this pattern:- It uses
lookBackMs = lookBackWindow.millis
- For each timestamp
ts
, it calculatespreviousTs = ts - lookBackMs
- This matches with the DriftStore's approach of fetching data from
startMs - lookBack.millis
- It uses
-
The implementation ensures that:
- The store fetches enough historical data by starting from
startMs - lookBack.millis
- The calculator then uses this data to compute drifts by comparing each point with its corresponding lookback point
- The final drift series starts from the requested
startMs
(as seen in the filter conditionts >= startMs
in TileDriftCalculator)
- The store fetches enough historical data by starting from
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for other usages of lookback window calculations
rg -l "lookback.millis" | xargs rg "startMs.*lookback"
Length of output: 54
Script:
#!/bin/bash
# Let's try a broader search for lookback window calculations
rg -A 5 "lookBack"
Length of output: 6329
Script:
#!/bin/bash
# Search for startMs usage patterns
rg -A 5 "startMs.*-"
Length of output: 1418
Script:
#!/bin/bash
# Search for time window calculations in the codebase
ast-grep --pattern 'val $_ = $_ - $_'
Length of output: 1816
Script:
#!/bin/bash
# Let's check the context of the DriftStore class
rg -A 10 "class DriftStore"
Length of output: 1279
online/src/main/scala/ai/chronon/online/stats/Display.scala (1)
1-6
: LGTM: Imports are appropriate
The imports are correctly chosen for web server implementation and HTML generation using cask and scalatags.
spark/src/test/scala/ai/chronon/spark/test/OnlineUtils.scala (1)
159-159
: Verify the impact of metadata dataset changes
The constant change from ChrononMetadataKey
to MetadataDataset
might affect other parts of the system.
Let's verify the usage of these constants:
✅ Verification successful
Constant change from ChrononMetadataKey
to MetadataDataset
is consistent across the codebase
The search results show:
- No references to the old constant
ChrononMetadataKey
exist in the codebase - The new constant
MetadataDataset
is consistently used across all relevant files - The constant is properly defined in
api/src/main/scala/ai/chronon/api/Constants.scala
as"chronon_metadata"
- All KVStore creation patterns using this constant are aligned
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for any remaining references to the old constant
rg "ChrononMetadataKey"
# Search for all usages of the new constant to ensure consistency
rg "MetadataDataset"
# Search for any KVStore creation patterns to ensure alignment
ast-grep --pattern 'inMemoryKvStore.create($CONST)'
Length of output: 4108
spark/src/main/scala/ai/chronon/spark/stats/drift/Expressions.scala (2)
Line range hint 74-98
: LGTM! Well-implemented mutable collection usage
The implementation correctly uses mutable collections for better performance, as suggested by previous learnings. The getOrElseUpdate
pattern is an idiomatic Scala approach for lazy initialization.
73-73
: Verify the impact of TileRow's breaking change
The change from TileSummaries
to TileSummary
in the case class is a breaking change that could affect downstream code.
spark/src/main/scala/ai/chronon/spark/stats/drift/Summarizer.scala (2)
5-5
: LGTM: Import consolidation looks good
The wildcard import from ai.chronon.api._
simplifies the import section while maintaining access to required entities.
258-258
: LGTM: Serializer type simplification
The removal of explicit type declaration for serializer
is safe here as Scala's type inference will correctly determine the type from the imported compactSerializer
.
Let's verify the imported serializer's usage across the codebase:
✅ Verification successful
Let me analyze the results and generate additional verification scripts to ensure we have a complete understanding.
Type inference for serializer is correct and safe
The verification confirms that:
- The
compactSerializer
is correctly imported fromDriftStore
DriftStore.compactSerializer
returns aSerializableSerializer
type- The serializer is used in a consistent way for serializing tile rows in the RDD transformation
- There's even a documented reason for using
TBinaryProtocol.Factory()
instead of compact protocol
The removal of explicit type declaration is indeed safe as Scala will infer the correct SerializableSerializer
type from DriftStore.compactSerializer
.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the usage of compactSerializer from DriftStore
# Expected: All serializer instantiations should use DriftStore.compactSerializer
# Search for any remaining SerializableSerializer declarations
rg "SerializableSerializer"
# Search for compactSerializer usage
rg "compactSerializer"
# Search for any custom serializer implementations
ast-grep --pattern 'class $name extends Serializer { $$$ }'
Length of output: 1353
Script:
#!/bin/bash
# Let's verify the actual usage context of serializer in Summarizer.scala
rg -B 5 -A 5 "val serializer = compactSerializer" spark/src/main/scala/ai/chronon/spark/stats/drift/Summarizer.scala
# Check if there are any other serializer-related imports in Summarizer
rg "import.*[Ss]erializer" spark/src/main/scala/ai/chronon/spark/stats/drift/Summarizer.scala
# Verify the type signature of compactSerializer in DriftStore
rg -B 2 -A 2 "def compactSerializer" online/src/main/scala/ai/chronon/online/stats/DriftStore.scala
Length of output: 1043
spark/src/test/scala/ai/chronon/spark/test/stats/drift/PrepareData.scala (4)
7-10
: LGTM: Clean imports and well-structured field declarations
The new imports and field declarations are well-organized and follow Scala naming conventions.
Also applies to: 43-46
142-151
: LGTM: Well-structured join configuration
The join configuration is clean and properly integrates the drift metrics functionality through metadata.
Line range hint 241-293
: LGTM: Clean schema implementation with clear categorization
The schema implementation is well-structured with:
- Reusable field prefixing through implicit class
- Clear categorization of fields with helpful comments
- Consistent naming conventions
176-177
: Previous concern about negative maxGap is still valid
The potential for negative maxGap leading to runtime exceptions hasn't been addressed yet.
api/thrift/api.thrift (1)
295-314
:
Fix typos and standardize naming pattern.
Issues found:
- Typo in field name:
percentilDrift
should bepercentileDrift
- Inconsistent naming pattern: some fields use 'Change' while others use 'Drift'
Apply this diff to fix the issues:
struct TileDrift {
- 1: optional double percentilDrift
+ 1: optional double percentileDrift
- 3: optional double countChangePercent
- 4: optional double nullRatioChangePercent
+ 3: optional double countDrift
+ 4: optional double nullRatioDrift
- 5: optional double innerCountChangePercent
- 6: optional double innerNullCountChangePercent
+ 5: optional double innerCountDrift
+ 6: optional double innerNullCountDrift
}
api/src/main/scala/ai/chronon/api/Extensions.scala (7)
161-169
: LGTM! Well-structured null handling.
The implementation uses idiomatic Scala patterns for safe null handling and provides a sensible default through Constants.
891-901
: Reiterating past review comment: Handle potential null values.
902-912
: Reiterating past review comment: Handle potential null names.
914-921
: LGTM! Clean functional implementation.
The method follows Scala best practices with proper null handling and functional composition.
922-937
: LGTM! Well-structured column name generation.
The implementation properly handles null cases and follows a clear pattern for generating column names.
939-947
: Reiterating past review comment: Avoid mapValues.
949-953
: Reiterating past review comment: Ensure safe collection access.
hub/app/controllers/TimeSeriesController.scala (3)
146-146
: Previous comment still valid: Inconsistent error messages for drift algorithms
The error message here has been updated, but similar error messages elsewhere in the file remain outdated (e.g., line 178). This inconsistency could confuse users about available drift algorithm options. Please update all similar error messages throughout the file to consistently show "Expect JSD, PSI, or Hellinger".
237-240
: Previous comment still valid: Improve JSON structure documentation
The comments describing the JSON structure are not properly formatted and lack context. Consider replacing them with properly formatted and detailed documentation to enhance code readability and maintainability.
283-290
: Previous comment still valid: Improve error handling in parseAlgorithm
method
The current implementation has several issues:
- Using
throw
in a parse method is not idiomatic Scala. - Case matching could be simplified.
- Error message is inconsistent with other messages in the file.
Please consider refactoring the method to address these concerns.
try { | ||
api.genKvStore.create(datasetName) | ||
} catch { | ||
// swallows all exceptions right now | ||
// TODO: swallow only already existing exception - move this into the kvstore.createIfNotExists | ||
case e: Exception => e.printStackTrace() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve exception handling in dataset creation
The current implementation swallows all exceptions, which could hide critical issues. The TODO comment suggests this needs to be refined.
Consider this approach:
- try {
- api.genKvStore.create(datasetName)
- } catch {
- // swallows all exceptions right now
- // TODO: swallow only already existing exception - move this into the kvstore.createIfNotExists
- case e: Exception => e.printStackTrace()
- }
+ try {
+ api.genKvStore.create(datasetName)
+ } catch {
+ case e: KVStoreException if e.getMessage.contains("already exists") => // Dataset already exists, ignore
+ case e: Exception =>
+ throw new RuntimeException(s"Failed to create dataset $datasetName: ${e.getMessage}", e)
+ }
Committable suggestion skipped: line range outside the PR's diff.
summaryDF.rdd.foreachPartition((rows: Iterator[Row]) => { | ||
val kvStore: KVStore = api.genKvStore | ||
|
||
val putRequests = new scala.collection.mutable.ArrayBuffer[PutRequest] | ||
for (row <- rows) { | ||
putRequests += PutRequest( | ||
Option(row.getAs[Array[Byte]]("keyBytes")).getOrElse(Array.empty[Byte]), | ||
Option(row.getAs[Array[Byte]]("valueBytes")).getOrElse(Array.empty[Byte]), | ||
statsTableName, | ||
Option(row.getAs[Long]("timestamp")) | ||
) | ||
def toPutRequest(row: Row): PutRequest = { | ||
if (row.isNullAt(keyIndex)) return null | ||
val keyBytes = row.getAs[Array[Byte]](keyIndex) | ||
val valueBytes = if (row.isNullAt(valueIndex)) Array.empty[Byte] else row.getAs[Array[Byte]](valueIndex) | ||
val timestamp = | ||
if (timestampIndex < 0 || row.isNullAt(timestampIndex)) None else Some(row.getAs[Long](timestampIndex)) | ||
PutRequest(keyBytes, valueBytes, datasetName, timestamp) | ||
} | ||
|
||
val futureResults = putRequests.grouped(putsPerRequest).map { batch => | ||
kvStore | ||
.multiPut(batch.toList) | ||
.map { result => | ||
if (!result.forall(identity)) { | ||
throw new RuntimeException(s"Failed to put ${result.count(!_)} records") | ||
} | ||
} | ||
.recover { | ||
case e => | ||
throw new RuntimeException(s"Failed to put batch: ${e.getMessage}", e) | ||
} | ||
} | ||
// TODO implement rate limiting | ||
val putResponses: Iterator[Future[Seq[Boolean]]] = rows | ||
.grouped(putsPerRequest) | ||
.map { _.map(toPutRequest).filter(_ != null).toArray } | ||
.map { requests => kvStore.multiPut(requests) } | ||
|
||
val aggregatedFuture = Future.sequence(futureResults.toSeq) | ||
val aggregatedFuture = Future.sequence(putResponses.toSeq).map(_.flatten) | ||
aggregatedFuture.onComplete { | ||
case scala.util.Success(_) => // All operations completed successfully | ||
case scala.util.Failure(e: IllegalArgumentException) => | ||
throw new IllegalArgumentException(s"Invalid request data: ${e.getMessage}", e) | ||
case scala.util.Failure(e: java.io.IOException) => | ||
throw new RuntimeException(s"KVStore I/O error: ${e.getMessage}", e) | ||
case scala.util.Failure(e) => | ||
case Success(_) => // All operations completed successfully | ||
case Failure(e) => | ||
throw new RuntimeException(s"Failed to upload summary statistics: ${e.getMessage}", e) | ||
} | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Async operations are safe, but rate limiting should be implemented
The review comment's concerns about async operations and rate limiting can be addressed based on the codebase analysis:
-
The async operations within
foreachPartition
are actually safe in the current implementation. TheFuture.sequence
followed byonComplete
ensures proper handling of failures, and the task won't complete until all futures are resolved. -
Rate limiting should indeed be implemented, following the pattern used in
DynamoDBKVStoreImpl
which demonstrates a proper implementation usingRateLimiter
with separate limiters for read and write operations. -
The error handling in the current implementation is sufficient for the use case, as it captures and propagates all failures with proper context.
Suggested implementation for rate limiting:
private val rateLimiter = RateLimiter.create(requestsPerSecondPerExecutor)
// In the existing code, before kvStore.multiPut:
rateLimiter.acquire(1)
🔗 Analysis chain
Ensure completion of async operations and implement rate limiting
The current implementation has several concerns:
- Async operations within
foreachPartition
might not complete before the task ends - Rate limiting is not implemented despite having the parameter
- Error handling could be more specific
Consider this implementation:
import scala.concurrent.duration._
import com.google.common.util.concurrent.RateLimiter
summaryDF.rdd.foreachPartition((rows: Iterator[Row]) => {
val kvStore: KVStore = api.genKvStore
val rateLimiter = RateLimiter.create(requestsPerSecondPerExecutor)
def toPutRequest(row: Row): PutRequest = {
// ... existing toPutRequest implementation ...
}
try {
val putResponses = rows
.grouped(putsPerRequest)
.map { batch =>
// Rate limit the requests
rateLimiter.acquire(1)
val requests = batch.map(toPutRequest).filter(_ != null).toArray
val future = kvStore.multiPut(requests)
// Wait for each batch to complete with timeout
Await.result(future, 30.seconds)
}
.toSeq
// Ensure all batches completed successfully
if (putResponses.exists(_.exists(!_))) {
throw new RuntimeException("Some put requests failed")
}
} catch {
case e: TimeoutException =>
throw new RuntimeException(s"Request timed out: ${e.getMessage}", e)
case e: IllegalArgumentException =>
throw new IllegalArgumentException(s"Invalid request data: ${e.getMessage}", e)
case e: Exception =>
throw new RuntimeException(s"Failed to upload summary statistics: ${e.getMessage}", e)
}
})
This implementation:
- Uses rate limiting to control request rate
- Ensures each batch completes before processing the next
- Provides more specific error handling
- Sets a timeout for each batch to prevent indefinite waiting
Note: While we generally avoid Await.result
, it's necessary here to ensure completion within foreachPartition
.
Run this script to check for potential rate limiting implementations in similar classes:
Let me check the rate limiting implementation in the DynamoDB store to better understand the existing patterns.
Let me check one more thing to understand how the async operations are handled in the DynamoDB implementation.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for rate limiting implementations in similar classes
rg -l "RateLimiter" --type scala
ast-grep --pattern 'class $_ extends $_ { $$$
val rateLimiter = RateLimiter.create($_)
$$$
}'
Length of output: 211
Script:
#!/bin/bash
# Check the rate limiting implementation in DynamoDBKVStoreImpl
rg -A 10 "RateLimiter" cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala
# Also check if there are any other rate limiting related configurations
rg -i "rate.*limit|throttl" --type scala
Length of output: 5167
Script:
#!/bin/bash
# Check how futures and async operations are handled in DynamoDBKVStoreImpl
rg -B 5 -A 5 "Future.sequence" cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala
# Also check if there are any await patterns used
rg "Await" --type scala
Length of output: 7960
if (a.isSetCount && b.isSetCount && b.getCount != 0.0) { | ||
// preserving sign, since direction of change is usually relevant | ||
val change = (a.getCount - b.getCount).toDouble / b.getCount | ||
result.setCountChangePercent(change * 100) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Enhance division by zero protection and error handling.
The code checks for zero in some denominators but not all. Also, consider adding logging for skipped calculations.
if (a.isSetCount && b.isSetCount && b.getCount != 0.0) {
val change = (a.getCount - b.getCount).toDouble / b.getCount
result.setCountChangePercent(change * 100)
+} else if (a.isSetCount && b.isSetCount) {
+ log.warn(s"Skipping count change calculation due to zero count in before summary")
}
if (a.isSetNullCount && b.isSetNullCount && b.getNullCount != 0) {
+ if (a.getCount == 0.0 || b.getCount == 0.0) {
+ log.warn("Skipping null ratio calculation due to zero count")
+ } else {
val aNullRatio = a.getNullCount.toDouble / a.getCount.toDouble
val bNullRatio = b.getNullCount.toDouble / b.getCount.toDouble
val change = (aNullRatio - bNullRatio) / bNullRatio
result.setNullRatioChangePercent(change * 100)
+ }
}
Also applies to: 45-51, 53-57, 59-65
@inline | ||
private def toArray(l: java.util.List[java.lang.Double]): Array[Double] = { | ||
l.iterator().toScala.map(_.toDouble).toArray | ||
} | ||
|
||
@inline | ||
private def toArray(l: java.util.List[java.lang.Integer]): Array[Int] = { | ||
l.iterator().toScala.map(_.toInt).toArray | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add null checks to array conversion methods.
The toArray
methods should handle null input gracefully to prevent NullPointerException
.
@inline
private def toArray(l: java.util.List[java.lang.Double]): Array[Double] = {
+ if (l == null) return Array.empty[Double]
l.iterator().toScala.map(_.toDouble).toArray
}
@inline
private def toArray(l: java.util.List[java.lang.Integer]): Array[Int] = {
+ if (l == null) return Array.empty[Int]
l.iterator().toScala.map(_.toInt).toArray
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
@inline | |
private def toArray(l: java.util.List[java.lang.Double]): Array[Double] = { | |
l.iterator().toScala.map(_.toDouble).toArray | |
} | |
@inline | |
private def toArray(l: java.util.List[java.lang.Integer]): Array[Int] = { | |
l.iterator().toScala.map(_.toInt).toArray | |
} | |
@inline | |
private def toArray(l: java.util.List[java.lang.Double]): Array[Double] = { | |
if (l == null) return Array.empty[Double] | |
l.iterator().toScala.map(_.toDouble).toArray | |
} | |
@inline | |
private def toArray(l: java.util.List[java.lang.Integer]): Array[Int] = { | |
if (l == null) return Array.empty[Int] | |
l.iterator().toScala.map(_.toInt).toArray | |
} |
} | ||
|
||
// a is after, b is before | ||
def between(a: TileSummary, b: TileSummary, metric: DriftMetric = DriftMetric.JENSEN_SHANNON): TileDrift = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add comprehensive input validation.
The method should validate input parameters at the start.
def between(a: TileSummary, b: TileSummary, metric: DriftMetric = DriftMetric.JENSEN_SHANNON): TileDrift = {
+ require(a != null, "After TileSummary (a) cannot be null")
+ require(b != null, "Before TileSummary (b) cannot be null")
+ require(metric != null, "DriftMetric cannot be null")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def between(a: TileSummary, b: TileSummary, metric: DriftMetric = DriftMetric.JENSEN_SHANNON): TileDrift = { | |
def between(a: TileSummary, b: TileSummary, metric: DriftMetric = DriftMetric.JENSEN_SHANNON): TileDrift = { | |
require(a != null, "After TileSummary (a) cannot be null") | |
require(b != null, "Before TileSummary (b) cannot be null") | |
require(metric != null, "DriftMetric cannot be null") |
driftSeries.foreach{s => println(s"${s.getKey.getColumn}: ${s.getPercentilDriftSeries.toScala}")} | ||
|
||
println("Drift series fetched successfully".green) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add assertions for drift calculation results
The test prints drift series results but doesn't validate them. Add assertions to verify the expected behavior.
- driftSeries.foreach{s => println(s"${s.getKey.getColumn}: ${s.getPercentilDriftSeries.toScala}")}
+ driftSeries.foreach { s =>
+ logger.info(s"${s.getKey.getColumn}: ${s.getPercentilDriftSeries.toScala}")
+ // Add assertions
+ s.getKey.getColumn should not be null
+ s.getPercentilDriftSeries should not be empty
+ // Verify drift values are within expected range
+ s.getPercentilDriftSeries.toScala.foreach { drift =>
+ drift should be >= 0.0
+ drift should be <= 1.0
+ }
+ }
- println("Drift series fetched successfully".green)
+ logger.info("Drift series fetched successfully")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
driftSeries.foreach{s => println(s"${s.getKey.getColumn}: ${s.getPercentilDriftSeries.toScala}")} | |
println("Drift series fetched successfully".green) | |
driftSeries.foreach { s => | |
logger.info(s"${s.getKey.getColumn}: ${s.getPercentilDriftSeries.toScala}") | |
// Add assertions | |
s.getKey.getColumn should not be null | |
s.getPercentilDriftSeries should not be empty | |
// Verify drift values are within expected range | |
s.getPercentilDriftSeries.toScala.foreach { drift => | |
drift should be >= 0.0 | |
drift should be <= 1.0 | |
} | |
} | |
logger.info("Drift series fetched successfully") |
val startMs = PartitionSpec.daily.epochMillis("2023-01-01") | ||
val endMs = PartitionSpec.daily.epochMillis("2023-01-29") | ||
val summariesFuture = driftStore.getSummaries(join, Some(startMs), Some(endMs)) | ||
val summaries = Await.result(summariesFuture, Duration.create(10, TimeUnit.SECONDS)) | ||
println(summaries) | ||
|
||
// fetch drift series | ||
val driftSeriesFuture = driftStore.getDriftSeries( | ||
join.metaData.nameToFilePath, | ||
DriftMetric.JENSEN_SHANNON, | ||
lookBack = new Window(7, chronon.api.TimeUnit.DAYS), | ||
startMs, | ||
endMs | ||
) | ||
val driftSeries = Await.result(driftSeriesFuture.get, Duration.create(10, TimeUnit.SECONDS)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve async operation handling
The current implementation has several issues with async operations:
- Hardcoded timeout duration
- No error handling for failed futures
- Missing validation of results
+ private val DefaultTimeout = Duration(10, TimeUnit.SECONDS)
+
- val startMs = PartitionSpec.daily.epochMillis("2023-01-01")
- val endMs = PartitionSpec.daily.epochMillis("2023-01-29")
+ val startMs = PartitionSpec.daily.epochMillis(TestConfig.StartDate)
+ val endMs = PartitionSpec.daily.epochMillis(TestConfig.EndDate)
val summariesFuture = driftStore.getSummaries(join, Some(startMs), Some(endMs))
- val summaries = Await.result(summariesFuture, Duration.create(10, TimeUnit.SECONDS))
+ val summaries = try {
+ Await.result(summariesFuture, DefaultTimeout)
+ } catch {
+ case e: Exception =>
+ fail(s"Failed to fetch summaries: ${e.getMessage}")
+ }
+ summaries should not be null
+ summaries should not be empty
val driftSeriesFuture = driftStore.getDriftSeries(
join.metaData.nameToFilePath,
DriftMetric.JENSEN_SHANNON,
lookBack = new Window(7, chronon.api.TimeUnit.DAYS),
startMs,
endMs
)
- val driftSeries = Await.result(driftSeriesFuture.get, Duration.create(10, TimeUnit.SECONDS))
+ val driftSeries = try {
+ Await.result(driftSeriesFuture.get, DefaultTimeout)
+ } catch {
+ case e: Exception =>
+ fail(s"Failed to fetch drift series: ${e.getMessage}")
+ }
+ driftSeries should not be empty
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
val startMs = PartitionSpec.daily.epochMillis("2023-01-01") | |
val endMs = PartitionSpec.daily.epochMillis("2023-01-29") | |
val summariesFuture = driftStore.getSummaries(join, Some(startMs), Some(endMs)) | |
val summaries = Await.result(summariesFuture, Duration.create(10, TimeUnit.SECONDS)) | |
println(summaries) | |
// fetch drift series | |
val driftSeriesFuture = driftStore.getDriftSeries( | |
join.metaData.nameToFilePath, | |
DriftMetric.JENSEN_SHANNON, | |
lookBack = new Window(7, chronon.api.TimeUnit.DAYS), | |
startMs, | |
endMs | |
) | |
val driftSeries = Await.result(driftSeriesFuture.get, Duration.create(10, TimeUnit.SECONDS)) | |
private val DefaultTimeout = Duration(10, TimeUnit.SECONDS) | |
val startMs = PartitionSpec.daily.epochMillis(TestConfig.StartDate) | |
val endMs = PartitionSpec.daily.epochMillis(TestConfig.EndDate) | |
val summariesFuture = driftStore.getSummaries(join, Some(startMs), Some(endMs)) | |
val summaries = try { | |
Await.result(summariesFuture, DefaultTimeout) | |
} catch { | |
case e: Exception => | |
fail(s"Failed to fetch summaries: ${e.getMessage}") | |
} | |
summaries should not be null | |
summaries should not be empty | |
// fetch drift series | |
val driftSeriesFuture = driftStore.getDriftSeries( | |
join.metaData.nameToFilePath, | |
DriftMetric.JENSEN_SHANNON, | |
lookBack = new Window(7, chronon.api.TimeUnit.DAYS), | |
startMs, | |
endMs | |
) | |
val driftSeries = try { | |
Await.result(driftSeriesFuture.get, DefaultTimeout) | |
} catch { | |
case e: Exception => | |
fail(s"Failed to fetch drift series: ${e.getMessage}") | |
} | |
driftSeries should not be empty |
private def generateChartJs(chart: Chart, chartId: String): String = { | ||
val data = chart.seriesList.map(_.series) | ||
val xData = chart.x.map(_.toString) | ||
chart.seriesList.map(_.name) | ||
|
||
val seriesConfig = chart.seriesList.map(s => s"""{ | ||
| label: "${s.name}", | ||
| stroke: "rgb(${scala.util.Random.nextInt(255)}, ${scala.util.Random.nextInt(255)}, ${scala.util.Random.nextInt(255)})" | ||
| | ||
|}""".stripMargin).mkString(",\n") | ||
|
||
val thresholdLines = (chart.moderateThreshold.map(t => s""" | ||
|{ | ||
| label: "Moderate Threshold", | ||
| value: $t, | ||
| stroke: "#ff9800", | ||
| style: [2, 2] | ||
|}""".stripMargin) ++ | ||
chart.severeThreshold.map(t => s""" | ||
|{ | ||
| label: "Severe Threshold", | ||
| value: $t, | ||
| stroke: "#f44336", | ||
| style: [2, 2] | ||
|}""".stripMargin)).mkString(",") | ||
|
||
s""" | ||
|new uPlot({ | ||
| title: "${chart.name}", | ||
| id: "$chartId", | ||
| class: "chart", | ||
| width: 800, | ||
| height: 400, | ||
| scales: { | ||
| x: { | ||
| time: false, | ||
| } | ||
| }, | ||
| series: [ | ||
| {}, | ||
| $seriesConfig | ||
| ], | ||
| axes: [ | ||
| {}, | ||
| { | ||
| label: "Value", | ||
| grid: true, | ||
| } | ||
| ], | ||
| plugins: [ | ||
| { | ||
| hooks: { | ||
| draw: u => { | ||
| ${if (thresholdLines.nonEmpty) | ||
s"""const lines = [$thresholdLines]; | ||
| for (const line of lines) { | ||
| const scale = u.scales.y; | ||
| const y = scale.getPos(line.value); | ||
| | ||
| u.ctx.save(); | ||
| u.ctx.strokeStyle = line.stroke; | ||
| u.ctx.setLineDash(line.style); | ||
| | ||
| u.ctx.beginPath(); | ||
| u.ctx.moveTo(u.bbox.left, y); | ||
| u.ctx.lineTo(u.bbox.left + u.bbox.width, y); | ||
| u.ctx.stroke(); | ||
| | ||
| u.ctx.restore(); | ||
| }""".stripMargin | ||
else ""} | ||
| } | ||
| } | ||
| } | ||
| ] | ||
|}, [${xData.mkString("\"", "\",\"", "\"")}, ${data | ||
.map(_.mkString(",")) | ||
.mkString("[", "],[", "]")}], document.getElementById("$chartId")); | ||
|""".stripMargin | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add input validation and escape special characters
The chart generation logic has potential issues:
- Missing validation for array lengths between x-axis and series data
- Names used in JS string interpolation aren't escaped, potentially leading to XSS vulnerabilities
Add input validation and escaping:
private def generateChartJs(chart: Chart, chartId: String): String = {
// Validate data
require(chart.x.length == chart.seriesList.head.series.length,
"X-axis length must match series length")
chart.seriesList.foreach(s =>
require(s.series.length == chart.x.length,
s"Series ${s.name} length must match x-axis length"))
// Escape special characters
def escapeJs(s: String): String =
s.replace("\"", "\\\"").replace("'", "\\'")
val escapedChartName = escapeJs(chart.name)
val escapedSeriesNames = chart.seriesList.map(s => escapeJs(s.name))
// Rest of the implementation...
}
def serve(board: Board, portVal: Int = 9032): Unit = { | ||
|
||
object Server extends cask.MainRoutes { | ||
@get("/") | ||
def index() = { | ||
val page = html( | ||
head( | ||
title(board.name), | ||
script(src := "https://unpkg.com/[email protected]/dist/uPlot.iife.min.js"), | ||
link(rel := "stylesheet", href := "https://unpkg.com/[email protected]/dist/uPlot.min.css"), | ||
tag("style")(""" | ||
|body { font-family: Arial, sans-serif; margin: 20px; } | ||
|.tab { display: none; } | ||
|.tab.active { display: block; } | ||
|.tab-button { padding: 10px 20px; margin-right: 5px; cursor: pointer; } | ||
|.tab-button.active { background-color: #ddd; } | ||
|.section { margin: 20px 0; } | ||
|.chart { margin: 20px 0; } | ||
""".stripMargin) | ||
), | ||
body( | ||
h1(board.name), | ||
div(cls := "tabs")( | ||
board.tabList.map(tab => | ||
button( | ||
cls := "tab-button", | ||
onclick := s"showTab('${tab.name}')", | ||
tab.name | ||
)) | ||
), | ||
board.tabList.map(tab => | ||
div(cls := "tab", id := tab.name)( | ||
tab.sectionList.map(section => | ||
div(cls := "section")( | ||
h2(section.name), | ||
section.charts.map(chart => | ||
div(cls := "chart")( | ||
div(id := s"${tab.name}-${section.name}-${chart.name}".replaceAll("\\s+", "-")) | ||
)) | ||
)) | ||
)), | ||
script(raw(""" | ||
|function showTab(tabName) { | ||
| document.querySelectorAll('.tab').forEach(tab => { | ||
| tab.style.display = tab.id === tabName ? 'block' : 'none'; | ||
| }); | ||
| document.querySelectorAll('.tab-button').forEach(button => { | ||
| button.classList.toggle('active', button.textContent === tabName); | ||
| }); | ||
|} | ||
| | ||
|// Show first tab by default | ||
|document.querySelector('.tab-button').click(); | ||
""".stripMargin)), | ||
script( | ||
raw( | ||
board.tabList | ||
.flatMap(tab => | ||
tab.sectionList.flatMap(section => | ||
section.charts.map(chart => | ||
generateChartJs(chart, s"${tab.name}-${section.name}-${chart.name}".replaceAll("\\s+", "-"))))) | ||
.mkString("\n") | ||
)) | ||
) | ||
) | ||
|
||
// page.render | ||
|
||
cask.Response( | ||
page.render, | ||
headers = Seq("Content-Type" -> "text/html") | ||
) | ||
} | ||
|
||
override def host: String = "0.0.0.0" | ||
override def port: Int = portVal | ||
|
||
initialize() | ||
} | ||
|
||
Server.main(Array()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add security headers and restrict binding address
The server implementation has security concerns:
- Binding to 0.0.0.0 exposes the server on all network interfaces
- Missing security headers like Content-Security-Policy
Consider these security improvements:
def serve(board: Board, portVal: Int = 9032, host: String = "127.0.0.1"): Unit = {
object Server extends cask.MainRoutes {
override def host: String = host
@get("/")
def index() = {
cask.Response(
page.render,
headers = Seq(
"Content-Type" -> "text/html",
"Content-Security-Policy" -> "default-src 'self' unpkg.com; script-src 'self' 'unsafe-inline' unpkg.com",
"X-Content-Type-Options" -> "nosniff",
"X-Frame-Options" -> "DENY"
)
)
}
// ...
}
}
api/thrift/api.thrift
Outdated
// PivotUtils.pivot(Array[(Long, TileDrift)]) = TileDriftSeries | ||
// used in front end after this is computed | ||
struct TileDriftSeries { | ||
1: optional list<double> percentilDriftSeries | ||
2: optional list<double> histogramDriftSeries | ||
3: optional list<double> countChangePercentSeries | ||
4: optional list<double> nullRatioChangePercentSeries | ||
|
||
5: optional list<double> innerCountChangePercentSeries | ||
6: optional list<double> innerNullCountChangePercentSeries | ||
7: optional list<double> lengthPercentilesDriftSeries | ||
8: optional list<double> stringLengthPercentilesDriftSeries | ||
|
||
200: optional list<i64> timestamps | ||
|
||
300: optional TileSeriesKey key | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Restructure TileDriftSeries and fix naming inconsistencies.
- Consider using a list of TileDrift objects for better maintainability
- Fix the naming inconsistencies to match the changes in TileDrift
-// PivotUtils.pivot(Array[(Long, TileDrift)]) = TileDriftSeries
-// used in front end after this is computed
+/**
+ * Time series of drift metrics, created by pivoting (TileDrift, timestamp) tuples
+ * using PivotUtils. Used by the frontend for visualization.
+ */
struct TileDriftSeries {
- 1: optional list<double> percentilDriftSeries
- 2: optional list<double> histogramDriftSeries
- 3: optional list<double> countChangePercentSeries
- 4: optional list<double> nullRatioChangePercentSeries
- 5: optional list<double> innerCountChangePercentSeries
- 6: optional list<double> innerNullCountChangePercentSeries
- 7: optional list<double> lengthPercentilesDriftSeries
- 8: optional list<double> stringLengthPercentilesDriftSeries
+ 1: optional list<TileDrift> drifts
}
Committable suggestion skipped: line range outside the PR's diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (10)
spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala (1)
Line range hint
35-44
: Replace println with proper test loggingThe
showTable
helper method usesprintln
for debugging output. Consider using ScalaTest's built-in logging or a proper logging framework for better test output management.def showTable(name: String)(implicit tableUtils: TableUtils): Unit = { - println(s"Showing table $name".yellow) + info(s"Showing table $name") val df = tableUtils.loadTable(name) val maxColNameLength = df.schema.fieldNames.map(_.length).max def pad(s: String): String = s.padTo(maxColNameLength, ' ') df.schema.fields.foreach { f => - println(s" ${pad(f.name)} : ${f.dataType.typeName}".yellow) + info(s" ${pad(f.name)} : ${f.dataType.typeName}") } df.show(10, truncate = false) }spark/src/main/scala/ai/chronon/spark/stats/drift/Summarizer.scala (2)
5-5
: Consider using explicit imports instead of wildcard importWhile
import ai.chronon.api._
consolidates multiple imports, explicit imports make dependencies clearer and help prevent naming conflicts.-import ai.chronon.api._ +import ai.chronon.api.{ColorString, Extensions, /* other specific imports */}
Line range hint
143-143
: Consider implementing persistence for cardinality calculationsThe TODO comment indicates missing persistence for cardinality calculations. This could lead to unnecessary recalculations and performance impact, especially with large datasets.
Consider:
- Implementing a caching mechanism using the kvstore
- Adding cache invalidation strategy based on data changes
- Including versioning for cached cardinality maps
api/thrift/api.thrift (3)
253-256
: Enhance documentation for the TileSummary struct.The comments are too brief and don't fully explain the purpose and usage of this important struct. Consider adding comprehensive documentation that explains:
- The relationship between categorical and numerical types
- The meaning of inner values and container types
- Examples of when to use different fields
319-335
: Consider restructuring TileDriftSeries for better maintainability.The current implementation duplicates fields from TileDrift, which could lead to maintenance issues. Additionally, the pivoting process needs better documentation.
Consider:
- Using a list of TileDrift objects instead of individual series
- Adding comprehensive documentation about the pivoting process
- Including examples of how the pivoting affects the data structure
struct TileDriftSeries { - 1: optional list<double> percentilDriftSeries - 2: optional list<double> histogramDriftSeries - 3: optional list<double> countChangePercentSeries - 4: optional list<double> nullRatioChangePercentSeries - 5: optional list<double> innerCountChangePercentSeries - 6: optional list<double> innerNullCountChangePercentSeries - 7: optional list<double> lengthPercentilesDriftSeries - 8: optional list<double> stringLengthPercentilesDriftSeries + 1: optional list<TileDrift> drifts }
354-359
: Move threshold documentation closer to implementation.The threshold documentation currently exists in the DriftMetric enum comment but would be more useful here where the drift metric is actually used.
Add documentation about thresholds:
// the current tile summary will be compared with older summaries using the metric // if the drift is more than the threshold, we will raise an alert + // Thresholds: + // - JENSEN_SHANNON: moderate (0.05-0.1), severe (>0.1) + // - HELLINGER: moderate (0.1-0.25), severe (>0.25) + // - PSI: moderate (0.1-0.2), severe (>0.2) // default drift metric to usespark/src/main/scala/ai/chronon/spark/Driver.scala (4)
868-871
: Enhance logging message with additional context.The current log message could be more informative by including the properties being set.
- logger.info(s"Creating table '${Constants.TiledSummaryDataset}'") + logger.info(s"Creating table '${Constants.TiledSummaryDataset}' with properties: $props")
882-883
: Track TODO item for configuration management.The TODO comment indicates a need to refactor the configuration handling to be more automated.
Would you like me to create a GitHub issue to track this technical debt item?
911-913
: Enhance error logging with complete context.The error message could be more helpful by including all relevant parameters.
- logger.error(s"Failed to summarize and upload data for conf: $confPath", e) + logger.error(s"Failed to summarize and upload data. Conf: $confPath, Parquet: ${args.parquetPath()}", e)
900-901
: Reduce code duplication by extracting confPath.The
confPath
variable is used in multiple places and could be extracted to avoid repetition.- val confPath = args.confPath() - val summarizer = new Summarizer(confPath, timeColumn = args.timeColumn.toOption) + val confPath = args.confPath() + val summarizer = new Summarizer(confPath, timeColumn = args.timeColumn.toOption) try { val df = sparkSession.read.parquet(args.parquetPath()) val (result, summaryExprs) = summarizer.computeSummaryDf(df) - val packer = new SummaryPacker(confPath, summaryExprs, summarizer.tileSize, summarizer.sliceColumns) + val packer = new SummaryPacker(summarizer.confPath, summaryExprs, summarizer.tileSize, summarizer.sliceColumns)Also applies to: 905-905
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (9)
api/src/main/scala/ai/chronon/api/Builders.scala
(3 hunks)api/src/main/scala/ai/chronon/api/Constants.scala
(2 hunks)api/thrift/api.thrift
(4 hunks)build.sbt
(1 hunks)online/src/main/scala/ai/chronon/online/stats/DriftMetrics.scala
(1 hunks)spark/src/main/scala/ai/chronon/spark/Driver.scala
(4 hunks)spark/src/main/scala/ai/chronon/spark/stats/drift/Summarizer.scala
(6 hunks)spark/src/test/scala/ai/chronon/spark/test/JoinTest.scala
(1 hunks)spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- api/src/main/scala/ai/chronon/api/Constants.scala
- build.sbt
- online/src/main/scala/ai/chronon/online/stats/DriftMetrics.scala
- spark/src/test/scala/ai/chronon/spark/test/JoinTest.scala
🔇 Additional comments (8)
api/src/main/scala/ai/chronon/api/Builders.scala (3)
270-271
: Past review comment is still applicable
The previous review comment about using Option[DriftSpec]
instead of null is still valid and hasn't been addressed.
287-287
: LGTM!
The historicalBackfill setter is correctly implemented and follows the established pattern.
296-297
: Verify drift metrics integration and add tests
Since this PR is focused on drift metrics and is marked as WIP:
- Please ensure proper integration with other drift-related components
- Add unit tests to verify the drift functionality
Let's check for related drift metric tests and implementations:
✅ Verification successful
Let's gather more information about the test coverage and implementation details:
Based on the comprehensive test and implementation evidence gathered, I can now generate the final response:
Drift functionality is well-tested and integrated
The codebase shows robust drift metric implementation with:
- Comprehensive unit tests covering different drift scenarios (low, moderate, severe) in
DriftMetricsTest.scala
- End-to-end integration tests in
DriftTest.scala
- Well-implemented drift metrics (Jensen-Shannon, PSI, Hellinger) with proper thresholds
- Complete integration with the DriftStore and other drift-related components
The null check in the builder is consistent with the codebase patterns and the DriftSpec
Thrift definition.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Search for drift-related tests and implementations
# Check for drift-related test files
echo "Searching for drift-related test files..."
fd -e scala -e java test.*Drift
# Check for drift metric implementations
echo "Searching for drift metric implementations..."
rg -l "class.*Drift|trait.*Drift|object.*Drift"
# Check for DriftSpec usage
echo "Checking DriftSpec usage..."
rg "DriftSpec"
Length of output: 1312
Script:
#!/bin/bash
# Check content of drift-related test files
echo "=== Content of DriftTest.scala ==="
cat "spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala"
echo -e "\n=== Content of DriftMetricsTest.scala ==="
cat "online/src/test/scala/ai/chronon/online/test/stats/DriftMetricsTest.scala"
# Check DriftSpec structure
echo -e "\n=== DriftSpec structure from Thrift ==="
rg -A 10 "struct DriftSpec" api/thrift/api.thrift
# Check implementation details
echo -e "\n=== DriftMetrics implementation ==="
cat "online/src/main/scala/ai/chronon/online/stats/DriftMetrics.scala"
Length of output: 13484
spark/src/main/scala/ai/chronon/spark/stats/drift/Summarizer.scala (3)
302-302
: Verify nameToFilePath implementation
The change uses metadata.nameToFilePath
for configuration path conversion. Ensure this method properly handles path conversion and exists in the MetaData class.
✅ Verification successful
Based on the results, I can now generate the final response:
nameToFilePath implementation is properly defined and consistently used
The implementation is verified in api/src/main/scala/ai/chronon/api/Extensions.scala
:
def nameToFilePath: String = metaData.name.replaceFirst("\\.", "/")
The method correctly converts dot-notation names to file paths by replacing the first dot with a forward slash. This implementation is used consistently across the codebase for path conversion in various components including:
- Spark jobs (LogFlattenerJob, ConsistencyJob)
- Test suites (SchemaEvolutionTest, FetcherTest)
- Online components (MetadataStore, Fetcher)
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check the implementation of nameToFilePath
rg "def\s+nameToFilePath|val\s+nameToFilePath" --type scala
# Check usage patterns
rg "\.nameToFilePath" --type scala -A 2
Length of output: 6241
30-30
: Parameter rename improves clarity but verify usage
The rename from name
to confPath
better reflects the parameter's purpose as a configuration path.
✅ Verification successful
Parameter rename is safe and all usages are consistent
The verification shows that all instances of Summarizer
instantiation are already using the parameter as a path:
- In
Driver.scala
:new Summarizer(confPath, ...)
- In
Summarizer.scala
:new Summarizer(metadata.nameToFilePath, ...)
- No instances found using the old parameter name 'name'
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for any remaining references to the old parameter name
rg "new Summarizer\([^)]*name\s*=" --type scala
Length of output: 49
Script:
#!/bin/bash
# Search for Summarizer instantiations to check parameter usage
rg "new Summarizer\(" -A 2 --type scala
# Also search for references to Summarizer class to understand usage
rg "Summarizer" --type scala
Length of output: 1719
229-229
: Verify thread safety of shared serializer
While the changes maintain consistency with the Summarizer
class, using a shared compactSerializer
requires careful consideration of thread safety in a Spark environment.
Also applies to: 246-246, 256-256
✅ Verification successful
Let me check the TSerializer implementation to understand its thread safety guarantees.
Based on the search results, I can now generate the final response since I have enough information about the serializer implementation:
Shared serializer is thread-safe by design
The compactSerializer
is implemented in DriftStore
as a factory method that creates a new instance for each call:
def compactSerializer: SerializableSerializer = new SerializableSerializer(new TBinaryProtocol.Factory())
This ensures thread safety as each thread gets its own serializer instance. The change from local definition to imported serializer is safe to use in Spark's distributed environment.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check the implementation of compactSerializer for thread safety
rg -A 5 "compactSerializer\s*=|def\s+compactSerializer" --type scala
Length of output: 1161
Script:
#!/bin/bash
# Check if TSerializer is instantiated per request or shared
rg -B 3 -A 3 "TSerializer|compactSerializer" --type scala
# Check if there are any comments or documentation about thread safety
rg -B 3 -A 3 "thread.*safe|concurrent" --type scala
Length of output: 46104
api/thrift/api.thrift (1)
302-302
: Previous feedback remains unaddressed.
spark/src/main/scala/ai/chronon/spark/Driver.scala (1)
Line range hint 949-953
: LGTM: Command registration changes are consistent.
The command registration and handling changes correctly reflect the object renaming from CreateStatsTable to CreateSummaryDataset.
Also applies to: 990-991
// mock api impl for online fetching and uploading | ||
val kvStoreFunc: () => KVStore = () => { | ||
// cannot reuse the variable - or serialization error | ||
val result = InMemoryKvStore.build("drift_test", () => null) | ||
result | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add resource cleanup for KVStore
The KVStore instance should be properly cleaned up after the test to prevent resource leaks.
val kvStoreFunc: () => KVStore = () => {
val result = InMemoryKvStore.build("drift_test", () => null)
+ // Register cleanup hook
+ sys.addShutdownHook {
+ try {
+ result.drop(Constants.MetadataDataset)
+ result.drop(Constants.TiledSummaryDataset)
+ } catch {
+ case e: Exception =>
+ info(s"Failed to cleanup KVStore: ${e.getMessage}")
+ }
+ }
result
}
Committable suggestion skipped: line range outside the PR's diff.
// fetch keys | ||
val tileKeys = driftStore.tileKeysForJoin(join) | ||
println(tileKeys) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add assertions for DriftStore tile keys
The test prints tile keys but doesn't validate them.
// fetch keys
val tileKeys = driftStore.tileKeysForJoin(join)
- println(tileKeys)
+ tileKeys should not be empty
+ // Verify key format
+ tileKeys.foreach { key =>
+ key should include(namespace)
+ key should include(join.metaData.nameToFilePath)
+ }
+ info(s"Fetched ${tileKeys.size} tile keys")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
// fetch keys | |
val tileKeys = driftStore.tileKeysForJoin(join) | |
println(tileKeys) | |
// fetch keys | |
val tileKeys = driftStore.tileKeysForJoin(join) | |
tileKeys should not be empty | |
// Verify key format | |
tileKeys.foreach { key => | |
key should include(namespace) | |
key should include(join.metaData.nameToFilePath) | |
} | |
info(s"Fetched ${tileKeys.size} tile keys") |
val startMs = PartitionSpec.daily.epochMillis("2023-01-01") | ||
val endMs = PartitionSpec.daily.epochMillis("2023-01-29") | ||
val summariesFuture = driftStore.getSummaries(join, Some(startMs), Some(endMs)) | ||
val summaries = Await.result(summariesFuture, Duration.create(10, TimeUnit.SECONDS)) | ||
println(summaries) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve async operation handling
The code uses hardcoded timeout duration and lacks proper error handling for async operations.
+ private val DefaultTimeout = Duration(10, TimeUnit.SECONDS)
+
val startMs = PartitionSpec.daily.epochMillis("2023-01-01")
val endMs = PartitionSpec.daily.epochMillis("2023-01-29")
val summariesFuture = driftStore.getSummaries(join, Some(startMs), Some(endMs))
- val summaries = Await.result(summariesFuture, Duration.create(10, TimeUnit.SECONDS))
+ val summaries = try {
+ Await.result(summariesFuture, DefaultTimeout)
+ } catch {
+ case e: Exception =>
+ fail(s"Failed to fetch summaries: ${e.getMessage}")
+ }
- println(summaries)
+ summaries should not be empty
+ info(s"Fetched ${summaries.size} summaries")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
val startMs = PartitionSpec.daily.epochMillis("2023-01-01") | |
val endMs = PartitionSpec.daily.epochMillis("2023-01-29") | |
val summariesFuture = driftStore.getSummaries(join, Some(startMs), Some(endMs)) | |
val summaries = Await.result(summariesFuture, Duration.create(10, TimeUnit.SECONDS)) | |
println(summaries) | |
private val DefaultTimeout = Duration(10, TimeUnit.SECONDS) | |
val startMs = PartitionSpec.daily.epochMillis("2023-01-01") | |
val endMs = PartitionSpec.daily.epochMillis("2023-01-29") | |
val summariesFuture = driftStore.getSummaries(join, Some(startMs), Some(endMs)) | |
val summaries = try { | |
Await.result(summariesFuture, DefaultTimeout) | |
} catch { | |
case e: Exception => | |
fail(s"Failed to fetch summaries: ${e.getMessage}") | |
} | |
summaries should not be empty | |
info(s"Fetched ${summaries.size} summaries") |
// fetch drift series | ||
val driftSeriesFuture = driftStore.getDriftSeries( | ||
join.metaData.nameToFilePath, | ||
DriftMetric.JENSEN_SHANNON, | ||
lookBack = new Window(7, chronon.api.TimeUnit.DAYS), | ||
startMs, | ||
endMs | ||
) | ||
val driftSeries = Await.result(driftSeriesFuture.get, Duration.create(10, TimeUnit.SECONDS)) | ||
|
||
val txnsTable = s"$namespace.txns" | ||
spark.sql(s"DROP TABLE IF EXISTS $txnsTable") | ||
driftSeries.foreach{s => println(s"${s.getKey.getColumn}: ${s.getPercentilDriftSeries.toScala}")} | ||
|
||
DataFrameGen.events(spark, dollarTransactions, numRows, partitions = partitions).save(txnsTable) | ||
TableUtils(spark).loadTable(txnsTable) | ||
println("Drift series fetched successfully".green) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add assertions for drift calculation results
The test prints drift series results but doesn't validate them. Add proper assertions to verify the expected behavior.
val driftSeriesFuture = driftStore.getDriftSeries(
join.metaData.nameToFilePath,
DriftMetric.JENSEN_SHANNON,
lookBack = new Window(7, chronon.api.TimeUnit.DAYS),
startMs,
endMs
)
- val driftSeries = Await.result(driftSeriesFuture.get, Duration.create(10, TimeUnit.SECONDS))
+ val driftSeries = try {
+ Await.result(driftSeriesFuture.get, DefaultTimeout)
+ } catch {
+ case e: Exception =>
+ fail(s"Failed to fetch drift series: ${e.getMessage}")
+ }
- driftSeries.foreach{s => println(s"${s.getKey.getColumn}: ${s.getPercentilDriftSeries.toScala}")}
+ driftSeries should not be empty
+ driftSeries.foreach { s =>
+ info(s"${s.getKey.getColumn}: ${s.getPercentilDriftSeries.toScala}")
+ // Verify drift values
+ s.getKey.getColumn should not be null
+ s.getPercentilDriftSeries should not be empty
+ // Verify drift values are within expected range
+ s.getPercentilDriftSeries.toScala.foreach { drift =>
+ drift should be >= 0.0
+ drift should be <= 1.0
+ }
+ }
- println("Drift series fetched successfully".green)
+ info("Drift series fetched successfully")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
// fetch drift series | |
val driftSeriesFuture = driftStore.getDriftSeries( | |
join.metaData.nameToFilePath, | |
DriftMetric.JENSEN_SHANNON, | |
lookBack = new Window(7, chronon.api.TimeUnit.DAYS), | |
startMs, | |
endMs | |
) | |
val driftSeries = Await.result(driftSeriesFuture.get, Duration.create(10, TimeUnit.SECONDS)) | |
val txnsTable = s"$namespace.txns" | |
spark.sql(s"DROP TABLE IF EXISTS $txnsTable") | |
driftSeries.foreach{s => println(s"${s.getKey.getColumn}: ${s.getPercentilDriftSeries.toScala}")} | |
DataFrameGen.events(spark, dollarTransactions, numRows, partitions = partitions).save(txnsTable) | |
TableUtils(spark).loadTable(txnsTable) | |
println("Drift series fetched successfully".green) | |
// fetch drift series | |
val driftSeriesFuture = driftStore.getDriftSeries( | |
join.metaData.nameToFilePath, | |
DriftMetric.JENSEN_SHANNON, | |
lookBack = new Window(7, chronon.api.TimeUnit.DAYS), | |
startMs, | |
endMs | |
) | |
val driftSeries = try { | |
Await.result(driftSeriesFuture.get, DefaultTimeout) | |
} catch { | |
case e: Exception => | |
fail(s"Failed to fetch drift series: ${e.getMessage}") | |
} | |
driftSeries should not be empty | |
driftSeries.foreach { s => | |
info(s"${s.getKey.getColumn}: ${s.getPercentilDriftSeries.toScala}") | |
// Verify drift values | |
s.getKey.getColumn should not be null | |
s.getPercentilDriftSeries should not be empty | |
// Verify drift values are within expected range | |
s.getPercentilDriftSeries.toScala.foreach { drift => | |
drift should be >= 0.0 | |
drift should be <= 1.0 | |
} | |
} | |
info("Drift series fetched successfully") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nikhil-zlai largely looks good. I flagged a couple of code changes that I'm not sure are intended as they seem to be yanking out some parts of the upstream base (and not directly linked to drift).
A q - Are your offset update changes part of this PR? I saw the old behavior still (start-offset, end) so confirming if I ether missed it elsewhere / needs to come as a follow up.
_ match { | ||
case Success(responseContext) => Some(responseContext) | ||
// TODO instrument failures | ||
case Failure(exception) => exception.printStackTrace(); None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
works for me
@@ -491,45 +490,6 @@ object Driver { | |||
} | |||
} | |||
|
|||
object DailyStats { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert? We shouldn't need to drop these right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the half finished version from cristian - might as well kill it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I guess, would be nice to have this in a follow up as this PR is super large as is but now that its here we can go with this.
@@ -65,126 +47,7 @@ class FetchStatsTest extends TestCase { | |||
val namespace = "fetch_stats" | |||
TimeZone.setDefault(TimeZone.getTimeZone("UTC")) | |||
private val today = tableUtils.partitionSpec.at(System.currentTimeMillis()) | |||
private val yesterday = tableUtils.partitionSpec.before(today) | |||
|
|||
def testFetchStats(): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (14)
spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala (2)
3-26
: Consider organizing imports for better readabilityConsider organizing imports into logical groups:
- Scala/Java standard libraries
- External dependencies
- Internal chronon packages
package ai.chronon.spark.test.stats.drift -import ai.chronon -import ai.chronon.api.ColorPrinter.ColorString -import ai.chronon.api.Constants -import ai.chronon.api.DriftMetric -import ai.chronon.api.Extensions.MetadataOps -import ai.chronon.api.PartitionSpec -import ai.chronon.api.Window -import ai.chronon.online.KVStore -import ai.chronon.online.stats.DriftStore -import ai.chronon.spark.SparkSessionBuilder -import ai.chronon.spark.TableUtils -import ai.chronon.spark.stats.drift.Summarizer -import ai.chronon.spark.stats.drift.SummaryUploader -import ai.chronon.spark.test.InMemoryKvStore -import ai.chronon.spark.test.MockApi -import org.apache.spark.sql.SparkSession -import org.scalatest.flatspec.AnyFlatSpec -import org.scalatest.matchers.should.Matchers -import java.util.concurrent.TimeUnit -import scala.concurrent.Await -import scala.concurrent.duration.Duration -import scala.util.ScalaJavaConversions.ListOps +// Scala/Java standard libraries +import java.util.concurrent.TimeUnit +import scala.concurrent.{Await, duration} +import scala.util.ScalaJavaConversions.ListOps + +// External dependencies +import org.apache.spark.sql.SparkSession +import org.scalatest.flatspec.AnyFlatSpec +import org.scalatest.matchers.should.Matchers + +// Internal chronon packages +import ai.chronon +import ai.chronon.api.{ + ColorPrinter, Constants, DriftMetric, + Extensions, PartitionSpec, Window +} +import ai.chronon.api.ColorPrinter.ColorString +import ai.chronon.api.Extensions.MetadataOps +import ai.chronon.online.{KVStore, stats} +import ai.chronon.online.stats.DriftStore +import ai.chronon.spark.{SparkSessionBuilder, TableUtils} +import ai.chronon.spark.stats.drift.{Summarizer, SummaryUploader} +import ai.chronon.spark.test.{InMemoryKvStore, MockApi}
96-102
: Review drift analysis time window configurationThe lookback window is hardcoded to 7 days. Consider:
- Making this configurable via test parameters
- Verifying if 7 days is sufficient for detecting drift patterns
- Adding test cases with different window sizes to validate boundary conditions
+ private val DefaultLookbackDays = 7 + val driftSeriesFuture = driftStore.getDriftSeries( join.metaData.nameToFilePath, DriftMetric.JENSEN_SHANNON, - lookBack = new Window(7, chronon.api.TimeUnit.DAYS), + lookBack = new Window(DefaultLookbackDays, chronon.api.TimeUnit.DAYS), startMs, endMs )online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala (3)
26-82
: Consider breaking down the large method into smaller, focused methods.The
between
method has grown quite large with multiple similar calculation blocks. Consider extracting each drift calculation into a separate private method to improve readability and maintainability.Example refactoring:
+private def calculatePercentileDrift(a: TileSummary, b: TileSummary, metric: DriftMetric, result: TileDrift): Unit = { + if (a.isSetPercentiles && b.isSetPercentiles) { + val drift = percentileDistance(toArray(a.getPercentiles), toArray(b.getPercentiles), metric) + result.setPercentileDrift(drift) + } +} +private def calculateHistogramDrift(a: TileSummary, b: TileSummary, metric: DriftMetric, result: TileDrift): Unit = { + if (a.isSetHistogram && b.isSetHistogram) { + val drift = histogramDistance(a.getHistogram, b.getHistogram, metric) + result.setHistogramDrift(drift) + } +} def between(a: TileSummary, b: TileSummary, metric: DriftMetric = DriftMetric.JENSEN_SHANNON): TileDrift = { val result = new TileDrift() - if (a.isSetPercentiles && b.isSetPercentiles) { - val drift = percentileDistance(toArray(a.getPercentiles), toArray(b.getPercentiles), metric) - result.setPercentileDrift(drift) - } - if (a.isSetHistogram && b.isSetHistogram) { - val drift = histogramDistance(a.getHistogram, b.getHistogram, metric) - result.setHistogramDrift(drift) - } + calculatePercentileDrift(a, b, metric, result) + calculateHistogramDrift(a, b, metric, result) // ... similar refactoring for other drift calculations result }
90-91
: Track optimization TODO with an issue.The TODO comment indicates a potential performance optimization. This should be tracked properly.
Would you like me to create a GitHub issue to track this optimization task? The issue would include:
- Current implementation details
- Performance implications
- Proposed optimization approach using a single pass
86-104
: Add detailed method documentation.The method would benefit from more comprehensive documentation explaining:
- The purpose and expected behavior
- Parameter requirements and constraints
- Return value format and meaning
- Edge cases handling
Example documentation:
/** * Calculates drift metrics between pairs of tile summaries based on a time window. * * @param summariesWithTimestamps Array of (TileSummary, timestamp) pairs to analyze * @param metric The drift metric to use for calculations * @param startMs The starting timestamp for drift calculations * @param lookBackWindow The time window to look back for previous summaries * @return Array of (TileDrift, timestamp) pairs containing drift metrics * * Note: For each summary with timestamp >= startMs, the method finds a previous * summary using lookBackWindow and calculates drift between them. */online/src/main/scala/ai/chronon/online/stats/PivotUtils.scala (3)
16-19
: Remove outdated TODO commentThe TODO comment about implementing
TileDrift => TileDriftSeries
is outdated since this functionality is already implemented in this file.-// TODO: TileDrift => TileDriftSeries
77-79
: Consider adding documentation for the pipe operatorThe pipe operator (
|>
) is a powerful feature, but its purpose might not be immediately clear to other developers. Consider adding ScalaDoc to explain its usage and benefits.+ /** + * Enables function chaining using the |> operator. + * Example: value |> function1 |> function2 + * is equivalent to: function2(function1(value)) + */ implicit class PipeOps[A](val value: A) extends AnyVal {
81-112
: Consider performance optimization for large datasetsThe current implementation creates multiple intermediate arrays and performs multiple iterations. For large datasets, this could impact performance. Consider:
- Using a single pass to collect all required data
- Preallocating collections with known sizes
- Using parallel collections for large datasets
online/src/main/scala/ai/chronon/online/stats/DriftStore.scala (5)
35-35
: Enhance error message for better debuggingThe error message could be more descriptive by including the expected format or requirements for the drift tile size.
- require(tileSize.nonEmpty, s"Drift tile size not set for join $joinName") + require(tileSize.nonEmpty, s"Drift tile size not set for join '$joinName'. Please ensure drift tile size is configured in join metadata.")
111-117
: Enhance error handling with context informationWhen handling failures, include relevant context information such as the tile key and group name to aid in debugging.
case Failure(exception) => - exception.printStackTrace(); None + val errorContext = s"Failed to process response for tile key: ${requestContext.tileKey}, group: ${requestContext.groupName}" + logger.error(errorContext, exception) + None
163-178
: Document method behavior and address TODOThe method would benefit from ScalaDoc documentation explaining its behavior and parameters. Additionally, the TODO about over-fetching should be tracked.
Would you like me to:
- Generate ScalaDoc documentation for this method?
- Create a GitHub issue to track the over-fetching optimization?
180-191
: Maintain documentation consistencyFor consistency with getDriftSeries, this method should also have ScalaDoc documentation explaining its behavior and parameters.
197-199
: Enhance protocol warning documentationThe warning about the compact protocol could be more specific about the nature of the bug and why binary protocol is preferred.
- // crazy bug in compact protocol - do not change to compact + // WARNING: Using binary protocol instead of compact protocol due to known serialization issues + // with complex nested structures in Thrift compact protocol. + // See: [Add link to issue/documentation if available]api/thrift/api.thrift (1)
253-256
: Enhance documentation for TileSummary struct.The current documentation provides a basic overview, but could be enhanced with:
- Examples of what constitutes a "slice"
- Explanation of how categorical vs continuous types are handled
- Description of container type handling
-// summary of distribution & coverage etc for a given (table, column, slice, tileWindow) -// for categorical types, distribution is histogram, otherwise percentiles -// we also handle container types by counting inner value distribution and inner value coverage +/** + * Summary of distribution and coverage metrics for a given (table, column, slice, tileWindow). + * + * @param percentiles For continuous types (numeric), stores distribution as percentiles + * @param histogram For categorical types, stores distribution as value frequencies + * @param count Total number of non-null values + * @param nullCount Total number of null values + * + * Container type handling: + * - innerCount: Total elements across all containers (e.g., total items in all arrays) + * - innerNullCount: Total null elements across all containers + * - lengthPercentiles: Distribution of container lengths + * + * String type handling: + * - stringLengthPercentiles: Distribution of string lengths for high cardinality fields + */
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (7)
api/thrift/api.thrift
(4 hunks)online/src/main/scala/ai/chronon/online/stats/DriftStore.scala
(1 hunks)online/src/main/scala/ai/chronon/online/stats/PivotUtils.scala
(1 hunks)online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala
(1 hunks)online/src/test/scala/ai/chronon/online/test/stats/PivotUtilsTest.scala
(1 hunks)spark/src/test/scala/ai/chronon/spark/test/FeatureWithLabelJoinTest.scala
(1 hunks)spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala
(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- spark/src/test/scala/ai/chronon/spark/test/FeatureWithLabelJoinTest.scala
🚧 Files skipped from review as they are similar to previous changes (1)
- online/src/test/scala/ai/chronon/online/test/stats/PivotUtilsTest.scala
🔇 Additional comments (9)
spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala (1)
46-55
: Verify impact of reduced sample size
The sample size has been reduced from 1,000,000 to 100,000 records. While this improves test execution time, please verify that:
- The smaller dataset still provides adequate coverage for drift detection
- Edge cases are still represented in the reduced dataset
- Performance characteristics can still be evaluated effectively
✅ Verification successful
Let me check the content of these files to understand if there are any specific performance requirements or benchmarks that might be affected by the reduced dataset size.
Let me check the default sample size in PrepareData
and look for any drift detection thresholds or requirements that might be affected by the sample size.
Let me check if there are any test assertions or validations in the drift test that might depend on the sample size.
After analyzing the complete context, I can now provide the final response:
Sample size reduction is acceptable for this end-to-end test
The reduced sample size (100K) is sufficient for this test case because:
- This is an end-to-end test that verifies the workflow functionality (data preparation, summarization, upload, and fetch) rather than statistical accuracy
- The test focuses on validating the drift detection pipeline's operational aspects, including tile key generation, summary series, and drift series computation
- No specific statistical thresholds or performance benchmarks are tied to the sample size in this test
The date range change (2023-01-01 to 2023-01-30) is intentional and appropriate as it aligns with the drift series validation period used in the test assertions.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if there are any performance-related tests or benchmarks that might be affected
rg -l "performance|benchmark|stress" --type scala
Length of output: 433
Script:
#!/bin/bash
# Check the content of test files for performance-related assertions or requirements
rg -A 5 -B 5 "performance|benchmark|stress" spark/src/test/scala/ai/chronon/spark/test/FetcherTest.scala
rg -A 5 -B 5 "performance|benchmark|stress" aggregator/src/test/scala/ai/chronon/aggregator/test/SawtoothAggregatorTest.scala
# Also check if there are any other drift tests with specific dataset size requirements
rg -A 5 -B 5 "generateFraudSampleData" spark/src/test/scala/ai/chronon/spark/test/stats/drift/
Length of output: 3702
Script:
#!/bin/bash
# Check for drift detection thresholds or requirements
rg -A 10 "class PrepareData" spark/src/test/scala/ai/chronon/spark/test/stats/drift/PrepareData.scala
# Look for any drift-related configurations or constants
rg "val.*Threshold|val.*Sample|val.*Size" spark/src/main/scala/ai/chronon/spark/stats/drift/
Length of output: 898
Script:
#!/bin/bash
# Check the complete DriftTest file for assertions and validations
cat spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala
# Also check for any drift-related test utilities
rg -l "DriftTest" spark/src/test/scala/ai/chronon/spark/test/
Length of output: 4582
online/src/main/scala/ai/chronon/online/stats/PivotUtils.scala (1)
1-15
: LGTM! Well-organized imports
The package structure and imports are properly organized, with clear separation between API and Java types.
online/src/main/scala/ai/chronon/online/stats/DriftStore.scala (4)
1-27
: LGTM! Well-structured class definition with appropriate imports
The class is well-organized with clear inheritance from MetadataStore and proper parameter defaults.
57-71
: LGTM! Well-designed helper classes with clear responsibilities
The case classes and conversion methods are well-structured and maintain immutability.
131-143
: LGTM! Well-documented time range handling
The Range class implementation is clear, with excellent documentation explaining the different scenarios. The logic correctly handles both small and large ranges with respect to lookback periods.
156-160
: Consider handling empty sequence edge case
The sequence matching could fail if the Future.sequence returns an empty or differently sized sequence.
- Future.sequence(Seq(currentSummaries, oldSummaries)).map {
- case Seq(current, old) =>
- old ++ current
+ Future.sequence(Seq(currentSummaries, oldSummaries)).map {
+ case Seq(current, old) =>
+ old ++ current
+ case other =>
+ logger.warn(s"Unexpected response sequence size: ${other.size}")
+ Seq.empty
}
api/thrift/api.thrift (3)
297-318
: Previous review comment about naming inconsistencies is still valid.
319-335
: Previous review comment about restructuring is still valid.
393-395
: LGTM!
The addition of driftSpec to MetaData is clean and well-documented.
val summarySeriesFuture = driftStore.getSummarySeries( | ||
join.metaData.nameToFilePath, | ||
startMs, | ||
endMs | ||
) | ||
val summarySeries = Await.result(summarySeriesFuture.get, Duration.create(10, TimeUnit.SECONDS)) | ||
summarySeries.foreach{s => println(s"${s.getKey.getColumn}: ${s.getPercentiles.toScala}")} | ||
println("Summary series fetched successfully".green) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add validation for summary series data
The test retrieves summary series but doesn't validate the data structure or content. Add assertions to verify:
- The structure of the summary series
- The completeness of the data
- The validity of percentile values
val summarySeries = Await.result(summarySeriesFuture.get, Duration.create(10, TimeUnit.SECONDS))
- summarySeries.foreach{s => println(s"${s.getKey.getColumn}: ${s.getPercentiles.toScala}")}
- println("Summary series fetched successfully".green)
+ summarySeries should not be empty
+ summarySeries.foreach { s =>
+ withClue(s"Summary series for column ${s.getKey.getColumn}") {
+ s.getKey.getColumn should not be null
+ val percentiles = s.getPercentiles.toScala
+ percentiles should not be empty
+ // Verify percentile values are within valid range [0,1]
+ percentiles.foreach { p =>
+ p should be >= 0.0
+ p should be <= 1.0
+ }
+ }
+ }
+ info("Summary series validated successfully")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
val summarySeriesFuture = driftStore.getSummarySeries( | |
join.metaData.nameToFilePath, | |
startMs, | |
endMs | |
) | |
val summarySeries = Await.result(summarySeriesFuture.get, Duration.create(10, TimeUnit.SECONDS)) | |
summarySeries.foreach{s => println(s"${s.getKey.getColumn}: ${s.getPercentiles.toScala}")} | |
println("Summary series fetched successfully".green) | |
val summarySeriesFuture = driftStore.getSummarySeries( | |
join.metaData.nameToFilePath, | |
startMs, | |
endMs | |
) | |
val summarySeries = Await.result(summarySeriesFuture.get, Duration.create(10, TimeUnit.SECONDS)) | |
summarySeries should not be empty | |
summarySeries.foreach { s => | |
withClue(s"Summary series for column ${s.getKey.getColumn}") { | |
s.getKey.getColumn should not be null | |
val percentiles = s.getPercentiles.toScala | |
percentiles should not be empty | |
// Verify percentile values are within valid range [0,1] | |
percentiles.foreach { p => | |
p should be >= 0.0 | |
p should be <= 1.0 | |
} | |
} | |
} | |
info("Summary series validated successfully") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do in a later PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nikhil-zlai, understood. Would you like me to open a new GitHub issue to track this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (2)
spark/src/main/scala/ai/chronon/spark/Driver.scala (2)
841-850
: Consider adding more detailed logging.While the implementation is correct, consider logging the properties being set for better observability:
def run(args: Args): Unit = { logger.info(s"Creating table '${Constants.TiledSummaryDataset}'") val store = args.api.genKvStore val props = Map("is-time-sorted" -> "true") + logger.info(s"Setting properties: $props") store.create(Constants.TiledSummaryDataset, props) }
Line range hint
857-892
: Enhance error handling with more context and cleanup.While the error handling is good, it could be improved to provide more context and handle cleanup.
try { val df = sparkSession.read.parquet(args.parquetPath()) val (result, summaryExprs) = summarizer.computeSummaryDf(df) val packer = new SummaryPacker(confPath, summaryExprs, summarizer.tileSize, summarizer.sliceColumns) val (packed, _) = packer.packSummaryDf(result) val uploader = new SummaryUploader(packed, args.api) uploader.run() } catch { case e: Exception => - logger.error(s"Failed to summarize and upload data for conf: $confPath", e) + logger.error( + s"""Failed to summarize and upload data: + |Conf: $confPath + |Parquet: ${args.parquetPath()} + |Time Column: ${args.timeColumn.getOrElse("default")}""".stripMargin, e) + // Consider cleanup of any temporary resources here throw e }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
online/src/main/java/ai/chronon/online/JavaFetcher.java
(0 hunks)online/src/main/scala/ai/chronon/online/Fetcher.scala
(1 hunks)spark/src/main/scala/ai/chronon/spark/Driver.scala
(5 hunks)spark/src/test/scala/ai/chronon/spark/test/FetchStatsTest.scala
(0 hunks)
💤 Files with no reviewable changes (2)
- online/src/main/java/ai/chronon/online/JavaFetcher.java
- spark/src/test/scala/ai/chronon/spark/test/FetchStatsTest.scala
🔇 Additional comments (3)
spark/src/main/scala/ai/chronon/spark/Driver.scala (3)
642-679
: LGTM! Well-structured error handling and result formatting.
The changes implement proper error handling, sorted output using TreeMap, and clear logging of results with timing information.
927-930
: LGTM! Consistent with the naming changes.
The argument class renaming maintains consistency with the object renaming from CreateStatsTable to CreateSummaryDataset.
968-969
: LGTM! Consistent command handling.
The case matches are correctly updated to use the new names while maintaining the consistent pattern of command handling.
// e.g. if we have a range of 1 day and lookBack of 1 week, we will make two queries (8 days ago, 7 days ago) + (1 day ago, today) | ||
// if the range is too large and the lookBack is too small, we will make single query | ||
// e.g. if we have a range of 1 week and lookBack of 1 day, we will make single query - (8 days ago, today) | ||
def lookBack(lookBackMs: Long): (Range, Option[Range]) = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piyush-zlai this logic does the lookback scanning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah thanks for flagging!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
docs/source/setup/Online_Integration.md
(1 hunks)
🔇 Additional comments (1)
docs/source/setup/Online_Integration.md (1)
21-22
: LGTM! Parameter rename improves clarity.
The rename from afterTsMillis
to startTsMillis
better reflects that this is the start of a timestamp range scan. The documentation clearly explains the parameter's purpose and implementation considerations.
// `startTsMillis` implies that this is a range scan of all values with `timestamp` >= to the specified one. This can be implemented efficiently, if `timestamp` can be a secondary key. Some databases have a native version id concept which also can map to timestamp. | ||
case class GetRequest(keyBytes: Array[Byte], dataset: String, startTsMillis: Option[Long] = None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Update documentation and code comments to reflect the parameter name change from afterTsMillis
to startTsMillis
The verification revealed inconsistencies in parameter naming across the codebase:
online/src/main/scala/ai/chronon/online/Api.scala
: Contains outdated comments referring toafterTsMillis
cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala
: UsesafterTsMillis
in commentscloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala
: Contains outdated reference toafterTsMillis
🔗 Analysis chain
Verify documentation consistency across the codebase.
Since this is a breaking API change, ensure that all related documentation and examples are updated to use startTsMillis
.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any remaining references to afterTsMillis in documentation and examples
# Search for afterTsMillis in markdown files
echo "Checking markdown files for outdated parameter name..."
rg --type md "afterTsMillis"
# Search for afterTsMillis in code examples
echo "Checking code examples for outdated parameter name..."
rg --type scala "afterTsMillis"
Length of output: 891
## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced a method for converting objects to a pretty-printed JSON string format. - Added functionality for calculating drift metrics between `TileSummary` instances. - Enhanced drift analysis capabilities with new metrics and structures. - New endpoints for model prediction and model drift in the API. - Introduced utility functions for transforming and aggregating data related to `TileSummary` and `TileDrift`. - Enhanced metadata handling with new constants and improved dataset references. - Added a method for processing percentiles and breakpoints to generate interval assignments. - **Bug Fixes** - Improved error handling in various methods for better clarity and logging. - **Refactor** - Renamed variables and methods for clarity and consistency. - Updated method signatures to accommodate new features and improve usability. - Consolidated import statements for better organization. - Removed deprecated objects and methods to streamline functionality. - **Tests** - Added comprehensive unit tests for drift metrics and pivot functionality. - Enhanced test coverage for new and modified features. - Removed outdated tests and added new tests for handling key mappings in joins. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced a method for converting objects to a pretty-printed JSON string format. - Added functionality for calculating drift metrics between `TileSummary` instances. - Enhanced drift analysis capabilities with new metrics and structures. - New endpoints for model prediction and model drift in the API. - Introduced utility functions for transforming and aggregating data related to `TileSummary` and `TileDrift`. - Enhanced metadata handling with new constants and improved dataset references. - Added a method for processing percentiles and breakpoints to generate interval assignments. - **Bug Fixes** - Improved error handling in various methods for better clarity and logging. - **Refactor** - Renamed variables and methods for clarity and consistency. - Updated method signatures to accommodate new features and improve usability. - Consolidated import statements for better organization. - Removed deprecated objects and methods to streamline functionality. - **Tests** - Added comprehensive unit tests for drift metrics and pivot functionality. - Enhanced test coverage for new and modified features. - Removed outdated tests and added new tests for handling key mappings in joins. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary ## Cheour clientslist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced a method for converting objects to a pretty-printed JSON string format. - Added functionality for calculating drift metrics between `TileSummary` instances. - Enhanced drift analysis capabilities with new metrics and structures. - New endpoints for model prediction and model drift in the API. - Introduced utility functions for transforming and aggregating data related to `TileSummary` and `TileDrift`. - Enhanced metadata handling with new constants and improved dataset references. - Added a method for processing percentiles and breakpoints to generate interval assignments. - **Bug Fixes** - Improved error handling in various methods for better clarity and logging. - **Refactor** - Renamed variables and methods for clarity and consistency. - Updated method signatures to accommodate new features and improve usability. - Consolidated import statements for better organization. - Removed deprecated objects and methods to streamline functionality. - **Tests** - Added comprehensive unit tests for drift metrics and pivot functionality. - Enhanced test coverage for new and modified features. - Removed outdated tests and added new tests for handling key mappings in joins. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Checklist
Summary by CodeRabbit
New Features
TileSummary
instances.TileSummary
andTileDrift
.Bug Fixes
Refactor
Tests