[WIP] Drift metrics #59

nikhil-zlai · 2024-11-03T22:23:15Z

Summary

Checklist

Added Unit Tests
Covered by existing CI
Integration tested
Documentation update

Summary by CodeRabbit

New Features
- Introduced a method for converting objects to a pretty-printed JSON string format.
- Added functionality for calculating drift metrics between TileSummary instances.
- Enhanced drift analysis capabilities with new metrics and structures.
- New endpoints for model prediction and model drift in the API.
- Introduced utility functions for transforming and aggregating data related to TileSummary and TileDrift.
- Enhanced metadata handling with new constants and improved dataset references.
- Added a method for processing percentiles and breakpoints to generate interval assignments.
Bug Fixes
- Improved error handling in various methods for better clarity and logging.
Refactor
- Renamed variables and methods for clarity and consistency.
- Updated method signatures to accommodate new features and improve usability.
- Consolidated import statements for better organization.
- Removed deprecated objects and methods to streamline functionality.
Tests
- Added comprehensive unit tests for drift metrics and pivot functionality.
- Enhanced test coverage for new and modified features.
- Removed outdated tests and added new tests for handling key mappings in joins.

coderabbitai · 2024-11-03T22:23:22Z

Walkthrough

The changes in this pull request introduce new methods and classes aimed at enhancing data processing and analysis, particularly focusing on JSON handling and drift analysis. Key modifications include the addition of a pretty-printing JSON functionality in ThriftJsonCodec.scala, the renaming of variables in LabelJoin.scala, and significant enhancements to drift metrics capabilities across various files. Additionally, several objects and methods related to drift analysis were introduced or modified, improving the overall functionality of the system.

Changes

File	Change Summary
`api/src/main/scala/ai/chronon/api/ThriftJsonCodec.scala`	Added method `toPrettyJsonStr` for pretty-printing JSON strings and a private variable `prettyGson`.
`spark/src/main/scala/ai/chronon/spark/LabelJoin.scala`	Renamed variable `labelTable` to `labelDf` in `computeLabelJoin` method.
`spark/src/main/scala/ai/chronon/spark/SparkSessionBuilder.scala`	Updated Spark local directory configuration to include `warehouseId`.
`spark/src/main/scala/ai/chronon/spark/stats/drift/DistanceMetrics.scala`	Removed `DriftMetrics` and `JensenShannonDivergenceCalculator` objects.
`.github/workflows/test_scala_and_python.yaml`	Updated resource allocation for `other_spark_tests` job.
`online/src/main/scala/ai/chronon/online/stats/AssignIntervals.scala`	Introduced `AssignIntervals` object with method `on` for processing percentiles and breakpoints.
`online/src/test/scala/ai/chronon/online/test/stats/AssignIntervalsTest.scala`	Added unit tests for `AssignIntervals.on` method.
`api/thrift/api.thrift`	Renamed `TileSummaries` to `TileSummary`, added `TileSummarySeries` and `TileDrift` structs, and modified `DriftMetric` enum.
`online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala`	Added methods for calculating drift metrics between `TileSummary` instances.
`api/src/main/scala/ai/chronon/api/Builders.scala`	Updated `MetaData` object to include new `driftSpec` parameter.
`api/src/main/scala/ai/chronon/api/Constants.scala`	Renamed `ChrononMetadataKey` to `MetadataDataset` and added new constants.
`api/src/main/scala/ai/chronon/api/Extensions.scala`	Added methods for metadata handling and derived columns.
`build.sbt`	Updated Spark dependency version from `3.5.0` to `3.5.1`.
`cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala`	Modified timestamp handling in `multiGet` and `buildQueryRequest` methods.
`cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala`	Changed timestamp filtering to use `startTsMillis`.
`hub/app/controllers/TimeSeriesController.scala`	Updated drift algorithm handling and error messages.
`hub/app/store/DynamoDBMonitoringStore.scala`	Modified `getModels` and `makeLoadedConfs` methods for improved data handling.
`hub/conf/routes`	Introduced new API routes for fetching timeseries data and removed old slice-specific routes.
`online/src/main/scala/ai/chronon/online/Api.scala`	Renamed `afterTsMillis` to `startTsMillis` in `GetRequest` class.
`online/src/main/scala/ai/chronon/online/Fetcher.scala`	Updated parameter in `fetchMetricsTimeseriesFromDataset` method.
`online/src/main/scala/ai/chronon/online/FetcherBase.scala`	Renamed `metaDataSet` parameter in constructor.
`online/src/main/scala/ai/chronon/online/MetadataStore.scala`	Updated default dataset parameter in constructor and `put` method.
`online/src/main/scala/ai/chronon/online/stats/DriftMetrics.scala`	Introduced methods for calculating drift metrics between datasets.
`online/src/main/scala/ai/chronon/online/stats/DriftStore.scala`	Added methods for managing drift-related data in `KVStore`.
`online/src/main/scala/ai/chronon/online/stats/PivotUtils.scala`	Introduced utility functions for data transformation and aggregation.
`online/src/test/scala/ai/chronon/online/test/stats/DriftMetricsTest.scala`	Added unit tests for drift metrics calculations.
`online/src/test/scala/ai/chronon/online/test/stats/PivotUtilsTest.scala`	Introduced unit tests for `PivotUtils` functionality.
`spark/src/main/scala/ai/chronon/spark/Driver.scala`	Removed `DailyStats` and `LogStats` objects, renamed `CreateStatsTable` to `CreateSummaryDataset`.
`spark/src/main/scala/ai/chronon/spark/stats/SummaryJob.scala`	Removed `SummaryJob` class and its methods.
`spark/src/main/scala/ai/chronon/spark/stats/drift/Summarizer.scala`	Updated constructor parameters and simplified import statements.
`spark/src/test/scala/ai/chronon/spark/test/JoinTest.scala`	Removed `SummaryJob` invocation and added new test case for key mapping.
`spark/src/test/scala/ai/chronon/spark/test/FeatureWithLabelJoinTest.scala`	Removed `testFinalViewsWithAggLabel` test method.
`online/src/main/java/ai/chronon/online/JavaFetcher.java`	Removed methods `fetchStatsTimeseries` and `fetchLogStatsTimeseries`.

Possibly related PRs

Summary upload #50: The SummaryUploader class introduced in this PR is directly related to the changes in the main PR, as both involve handling JSON data and uploading summary statistics to a key-value store. The SummaryUploader processes a DataFrame and uploads it, which aligns with the new JSON handling functionality in ThriftJsonCodec.scala.
Driver Summarizer #62: This PR enhances the summarization and uploading capabilities, which connects to the main PR's focus on JSON serialization and deserialization. The integration of the SummaryUploader in the data flow indicates a relationship with the new JSON formatting introduced in the main PR.

Suggested reviewers

chewy-zlai
piyush-zlai

🐇 In the meadow, where the data flows,
A pretty JSON string now glows.
With drift metrics dancing in the air,
Our stats and summaries, beyond compare!
Let's hop along, with tests galore,
For every change, there's so much more!
🐇

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 313b47c and 957f438.

📒 Files selected for processing (2)

spark/src/main/scala/ai/chronon/spark/stats/drift/Summarizer.scala (7 hunks)
spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

spark/src/main/scala/ai/chronon/spark/stats/drift/Summarizer.scala
spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (7)

api/src/main/scala/ai/chronon/api/ThriftJsonCodec.scala (1)
52-57: Consider optimizing the pretty-printing implementation

While the implementation is functional, there are a few suggestions for improvement:

The current implementation performs double conversion (string → JsonElement → pretty string). Consider creating a dedicated pretty-printing serializer to avoid this overhead.

Add error handling for potential JSON parsing failures during the pretty-printing process.

Here's a suggested improvement:
  @transient private lazy val prettyGson = new GsonBuilder().setPrettyPrinting().create()
+ @transient private lazy val prettySerializer: ThreadLocal[TSerializer] = new ThreadLocal[TSerializer] {
+   override def initialValue(): TSerializer = new TSerializer(new TSimpleJSONProtocol.Factory())
+ }
  
  def toPrettyJsonStr[T <: TBase[_, _]: Manifest](obj: T): String = {
-   val raw = toJsonStr(obj)
-   val je = prettyGson.fromJson(raw, classOf[com.google.gson.JsonElement])
-   prettyGson.toJson(je)
+   try {
+     val raw = new String(prettySerializer.get().serialize(obj), Constants.UTF8)
+     val je = prettyGson.fromJson(raw, classOf[com.google.gson.JsonElement])
+     prettyGson.toJson(je)
+   } catch {
+     case e: Exception =>
+       logger.error(s"Failed to pretty-print object: ${e.getMessage}")
+       toJsonStr(obj)  // Fallback to regular JSON string
+   }
  }
spark/src/main/scala/ai/chronon/spark/stats/drift/DistanceMetrics.scala (3)
Line range hint 39-41: Remove hardcoded normalization value in KS distance calculation.

The hardcoded value of 200.0 for normalization makes assumptions about the data range. This could lead to incorrect normalization if the actual data range differs.

Consider deriving the normalization factor from the actual data range:
-def kolmogorovSmirnovDistance(p: Seq[Double], q: Seq[Double]): Double = {
-  p.zip(q).map { case (pi, qi) => abs(pi - qi) }.max / 200.0 // Divide by range (200) to normalize
+def kolmogorovSmirnovDistance(p: Seq[Double], q: Seq[Double]): Double = {
+  val range = (p ++ q).max - (p ++ q).min
+  p.zip(q).map { case (pi, qi) => abs(pi - qi) }.max / range
}
124-139: Improve documentation and variable naming in pattern matching.

While the pattern matching improves readability, consider these enhancements:
-      equalityHits match {
-        case eh if eh == 0 => interval / (percentiles(i) - percentiles(i - 1))
+      // Handle different cases of matching percentile values at break points
+      equalityHits match {
+        case 0 => // No exact matches at break point
+          interval / (percentiles(i) - percentiles(i - 1))

-        case eh if eh == 1 =>
+        case 1 => // Single exact match at break point
           if (i == percentiles.length) {
             interval / (percentiles(i - 1) - percentiles(i - 2))
           } else {
             interval / (percentiles(i) - percentiles(i - 1))
           }

-        case eh if eh > 1 =>
+        case hits if hits > 1 => // Multiple exact matches (point-mass)
           // point-mass case: when two or more continuous percentile points, pct(i), pct(i+1), pct(i+2)
           // have the same value as the break
-          (equalityHits - 1) * interval
+          (hits - 1) * interval
       }
Line range hint 71-82: Move test code from main method to proper unit tests.

The TODO comment indicates this code should be moved to unit tests. This would improve test organization and coverage.

Would you like me to help create proper unit tests for these drift metrics? I can:

Generate test cases for each metric

Include edge cases and boundary conditions

Add proper assertions and test documentation
spark/src/main/scala/ai/chronon/spark/SparkSessionBuilder.scala (2)
81-81: LGTM! Consider some improvements for robustness.

The addition of warehouseId to prevent directory conflicts is good. However, consider these improvements:

Sanitize the name parameter to prevent path issues

Implement directory cleanup on session termination

Consider using java.nio.file.Paths.get() for robust path construction

Here's a suggested improvement:
- .config("spark.local.dir", s"/tmp/$userName/${name}_$warehouseId")
+ .config("spark.local.dir", Paths.get("/tmp", userName, s"${name.replaceAll("[^a-zA-Z0-9-_]", "_")}_$warehouseId").toString)
Also, consider adding a cleanup mechanism:
// Add to the build method before returning the spark session
spark.sparkContext.addSparkListener(new SparkListener {
  override def onApplicationEnd(applicationEnd: SparkListenerApplicationEnd): Unit = {
    val localDir = new File(s"/tmp/$userName/${name}_$warehouseId")
    if (localDir.exists()) {
      org.apache.commons.io.FileUtils.deleteDirectory(localDir)
    }
  }
})
Line range hint 28-28: Consider session-specific warehouse IDs.

Currently, warehouseId is defined at the object level, meaning all sessions created by this builder instance will share the same ID. Consider moving it to the build method for true session isolation.
- private val warehouseId = java.util.UUID.randomUUID().toString.takeRight(6)

  def build(...): SparkSession = {
+   val sessionId = java.util.UUID.randomUUID().toString.takeRight(6)
    ...
-   .config("spark.local.dir", s"/tmp/$userName/${name}_$warehouseId")
+   .config("spark.local.dir", s"/tmp/$userName/${name}_$sessionId")
spark/src/main/scala/ai/chronon/spark/LabelJoin.scala (1)

100-103: LGTM! Variable rename improves code clarity.

The rename from labelTable to labelDf better reflects that we're working with Spark DataFrames. This change enhances code readability by using more precise terminology.

Consider adding a brief comment above the labelDf declaration to document what this DataFrame represents and its schema structure, which would help future maintainers understand the data flow.

Also applies to: 122-122

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between fc35168 and a460eea.

📒 Files selected for processing (4)

api/src/main/scala/ai/chronon/api/ThriftJsonCodec.scala (2 hunks)
spark/src/main/scala/ai/chronon/spark/LabelJoin.scala (2 hunks)
spark/src/main/scala/ai/chronon/spark/SparkSessionBuilder.scala (1 hunks)
spark/src/main/scala/ai/chronon/spark/stats/drift/DistanceMetrics.scala (2 hunks)

🔇 Additional comments (5)

api/src/main/scala/ai/chronon/api/ThriftJsonCodec.scala (2)

28-28: LGTM: Import addition is appropriate

The addition of GsonBuilder import is well-placed and necessary for the new pretty-printing functionality.

52-57: Verify the necessity of pretty-printing for drift metrics

Given that this PR is titled "Drift metrics", let's verify if this pretty-printing functionality is actually needed for the drift metrics implementation.

spark/src/main/scala/ai/chronon/spark/stats/drift/DistanceMetrics.scala (2)

Line range hint 1-7: LGTM! Good package organization.

The move to a dedicated drift package and renaming to DriftMetrics improves code organization and better reflects the component's purpose.

111-112: LGTM! Good input validation.

The added requirement check helps prevent invalid input data.

spark/src/main/scala/ai/chronon/spark/SparkSessionBuilder.scala (1)

81-81: Verify the impact on concurrent Spark sessions.

Let's verify that this change properly handles concurrent Spark sessions and doesn't cause any directory conflicts.

✅ Verification successful

Let me gather more specific information about the temp directory handling.

Concurrent Spark sessions are properly handled with unique directory paths

The implementation safely handles concurrent sessions through:

Unique directory paths using UUID for each warehouse: warehouseId = java.util.UUID.randomUUID().toString.takeRight(6)
User-specific directories: /tmp/$userName/
Session-specific subdirectories with unique names: ${name}_$warehouseId

These mechanisms ensure that:

Each Spark session gets its own isolated local directory
Different users have separate spaces
Multiple sessions from the same user don't conflict
Temporary files are properly segregated

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for potential directory conflicts or issues with concurrent sessions

# Test 1: Look for any hardcoded spark.local.dir configurations that might conflict
rg -l 'spark\.local\.dir'

# Test 2: Check for any cleanup mechanisms or related SparkListener implementations
ast-grep --pattern 'class $_ extends SparkListener {
  $$$
}'

# Test 3: Look for any existing temp directory cleanup logic
rg -l 'deleteDirectory|cleanUp|cleanup|delete.*Directory'

Length of output: 581

Script:

#!/bin/bash
# Check the actual implementation of spark.local.dir configuration and related cleanup
rg -A 10 'spark\.local\.dir' spark/src/main/scala/ai/chronon/spark/SparkSessionBuilder.scala

# Check cleanup implementations in LocalTableExporter
rg -A 10 'deleteDirectory|cleanUp|cleanup|delete.*Directory' spark/src/main/scala/ai/chronon/spark/LocalTableExporter.scala

# Check if there are any other session management or cleanup related code
rg -A 5 'session|Session' spark/src/main/scala/ai/chronon/spark/SparkSessionBuilder.scala

# Look for any concurrent session handling mechanisms
ast-grep --pattern 'def $name($$$) = {
  $$$
  SparkSession
  $$$
}'

Length of output: 3774

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between a460eea and 30076bc.

📒 Files selected for processing (2)

.github/workflows/test_scala_and_python.yaml (2 hunks)
api/src/main/scala/ai/chronon/api/ColorPrinter.scala (0 hunks)

💤 Files with no reviewable changes (1)

api/src/main/scala/ai/chronon/api/ColorPrinter.scala

🧰 Additional context used

🪛 actionlint

.github/workflows/test_scala_and_python.yaml

52-52: label "large" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-14.0", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-13.0", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "macos-12.0", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

71-71: label "large" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-14.0", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-13.0", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "macos-12.0", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🔇 Additional comments (1)

.github/workflows/test_scala_and_python.yaml (1)

Line range hint 52-90: Verify resource requirements for Spark tests

Both modified jobs use significant JVM memory (-Xmx8G). Please ensure the selected runner has sufficient resources to handle these memory requirements.

Consider:

Documenting the resource requirements in the workflow file
Setting up monitoring to track memory usage during test runs
Implementing test segmentation if memory usage is consistently high

Also applies to: 71-90

✅ Verification successful

The previous output showed consistent memory settings (-Xmx8G) across multiple test jobs. Let's verify the runner specifications and any available documentation about resource requirements before making a final determination.

Memory requirements aligned with runner specifications

The verification shows that jobs with high memory requirements (-Xmx8G) are appropriately configured to run on the large runner, while other jobs use ubuntu-latest. This indicates proper resource allocation awareness in the workflow design:

Jobs with intensive Spark tests use the large runner
Other test jobs use the standard ubuntu-latest runner
Memory settings are consistently set to 8GB max heap across all jobs

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check memory settings across Spark test jobs
# Look for memory-related configurations in the workflow
rg -A 2 "SBT_OPTS.*Xmx" .github/workflows/test_scala_and_python.yaml

# Look for Spark memory configurations in Scala tests
rg "spark\..*memory" --type scala

Length of output: 983

Script:

#!/bin/bash
# Check the runner specifications in the workflow file
rg "runs-on:" -A 2 .github/workflows/test_scala_and_python.yaml

# Check if there are any resource-related configurations or documentation
rg -i "memory|resource|runner" .github/workflows/README.md || true

# Look for any GitHub Actions workflow documentation in the repository
fd README.md .github/

Length of output: 911

🧰 Tools

🪛 actionlint

71-71: label "large" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-14.0", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-13.0", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "macos-12.0", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

.github/workflows/test_scala_and_python.yaml

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 30076bc and 6397903.

📒 Files selected for processing (2)

.github/workflows/test_scala_and_python.yaml (2 hunks)
api/src/main/scala/ai/chronon/api/ColorPrinter.scala (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

api/src/main/scala/ai/chronon/api/ColorPrinter.scala

🧰 Additional context used

🪛 actionlint

.github/workflows/test_scala_and_python.yaml

52-52: label "large" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-14.0", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-13.0", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "macos-12.0", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

71-71: label "large2" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-14.0", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-13.0", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "macos-12.0", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🔇 Additional comments (1)

.github/workflows/test_scala_and_python.yaml (1)

52-52: Previous review comment is still applicable.

The issue with the invalid runner label 'large' has been previously identified and the suggested fix remains valid.

🧰 Tools

🪛 actionlint

52-52: label "large" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-14.0", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-13.0", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "macos-12.0", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

.github/workflows/test_scala_and_python.yaml

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 6397903 and 440e520.

📒 Files selected for processing (1)

.github/workflows/test_scala_and_python.yaml (2 hunks)

🧰 Additional context used

🪛 actionlint

.github/workflows/test_scala_and_python.yaml

52-52: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-14.0", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-13.0", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "macos-12.0", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🔇 Additional comments (1)

.github/workflows/test_scala_and_python.yaml (1)

67-67: Verify memory requirements for Spark tests

The SBT memory settings have been significantly increased from 8G to 24G. While this aligns with the intention to use more resources, please verify:

Whether these tests actually require such high memory
If this could cause issues when running locally
If all CI environments can support this memory allocation

✅ Verification successful

Based on the workflow file content, I can now provide a final response:

Memory settings are appropriately configured for different test suites

The increased memory (24G) is specifically allocated to the other_spark_tests job running on a dedicated high-resource runner (ubuntu-8_cores-32_gb). Other Spark test jobs (join, mutation, fetcher) use a lower 8G configuration on standard runners. This split configuration suggests:

The main Spark test suite requires more resources and runs on appropriate infrastructure
Other specialized test suites are optimized with lower memory requirements
The CI environment explicitly supports these allocations with the high-resource runner

Local development isn't impacted as these settings are CI-specific. The configuration is well-balanced across different test categories based on their resource needs.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check memory usage patterns in Spark tests

# Look for memory-intensive test configurations
rg -A 5 "spark.executor.memory|spark.driver.memory|spark.memory" --type scala

# Find potential memory-intensive operations in test files
ast-grep --pattern 'class $_ extends $_ {
  $$$
  test($_, $_) {
    $$$
    collect($$$)
    $$$
  }
  $$$
}'

Length of output: 192

Script:

#!/bin/bash
# Let's try a different approach to analyze memory usage patterns

# Check for large test data files or data generation in tests
rg -g "*/test/*" -i "generate.*data|large.*dataset" --type scala

# Look for test configurations and settings
fd "application.conf|reference.conf" --exec cat {}

# Check commit history for memory-related changes
git log -p --grep="memory\|heap\|OOM" .github/workflows/test_scala_and_python.yaml

# Look for spark session configurations in test files
rg "SparkSession|sparkContext" --type scala -A 5 -g "*/test/*"

Length of output: 1413

Script:

#!/bin/bash
# Let's check the workflow file history and current state

# Show the current workflow file content
cat .github/workflows/test_scala_and_python.yaml

# Check recent changes to the workflow file
git log -p .github/workflows/test_scala_and_python.yaml -n 3

# List all scala test files to understand the test suite size
fd ".scala" --type f --glob "**/test/**"

# Check for any OOM or memory-related issues in recent commits
git log --grep="OOM\|memory\|heap" -i --since="6 months ago"

Length of output: 5150

coderabbitai

Actionable comments posted: 6

🧹 Outside diff range and nitpick comments (14)

online/src/test/scala/ai/chronon/online/test/stats/AssignIntervalsTest.scala (5)
7-9: Remove extra empty line.

There's an unnecessary extra empty line after the class declaration.
 class AssignIntervalsTest extends AnyFlatSpec with Matchers {
-
10-10: Improve test description for clarity.

The current description "assign weights into intervals between breaks" is too vague. Consider making it more specific about what's being tested and the expected behavior.
-    "assignment" should "assign weights into intervals between breaks" in {
+    "AssignIntervals.on" should "correctly distribute weights between breakpoints including edge cases with repeated values" in {
11-13: Improve test data readability.

The input arrays could be formatted more clearly to show the relationship between values and their expected grouping.
-      val ptiles = Array(   1,      4,   6,6,6,    8, 9      )
-      val breaks = Array(0, 1, 2, 3,  5, 6,     7, 8, 9,   10)
+      val ptiles = Array(
+        1,    // falls in first interval
+        4,    // falls in middle
+        6,6,6,// repeated values
+        8, 9  // upper range
+      )
+      val breaks = Array(
+        0, 1, 2, 3, 5, 6, 7, 8, 9, 10  // breakpoints defining intervals
+      )
14-14: Remove commented-out code.

The unused interval variable should be removed if it's not needed.
-      //val interval = 0.25
10-23: Add test documentation.

The test lacks documentation explaining the purpose and the mathematical reasoning behind the expected values. This would help maintainers understand the test case better.

Add a comment block explaining:

The purpose of the test

How the expected values are calculated

The significance of edge cases (repeated values)
+    /**
+     * Tests the interval assignment logic with various edge cases:
+     * - Single value in an interval (1 in [0,1])
+     * - Multiple repeated values (6,6,6)
+     * - Values near interval boundaries
+     *
+     * Expected values represent the weighted distribution of points
+     * between consecutive breakpoints. For example:
+     * - 1/3 represents one point distributed across 3 possible intervals
+     * - 2.5 represents multiple points (the three 6's) distributed
+     *   between two intervals
+     */
     "assignment" should "assign weights into intervals between breaks" in {
online/src/main/scala/ai/chronon/online/stats/AssignIntervals.scala (4)
16-16: Consider renaming the method on to a more descriptive name

The method name on may not clearly convey its purpose. Renaming it to something more descriptive, such as assignIntervals or computeIntervalAssignments, would enhance readability and maintainability.

17-21: Use more descriptive variable names for clarity

Variables like i, bIndex, bLen, and pLen could be renamed to more descriptive names to improve code readability. For example:

i → ptileIndex

bIndex → breakIndex

bLen → breaksLength

pLen → ptilesLength

Apply this diff to rename the variables:
-var i = 0
+var ptileIndex = 0
 var prev = breaks.head
-var bIndex = 1
+var breakIndex = 1
-var bLen = breaks.length
+val breaksLength = breaks.length
-var pLen = ptiles.length
+val ptilesLength = ptiles.length
 val arr = Array.fill(bLen - 1)(0.0)
Ensure to update all occurrences of these variables throughout the method.

16-57: Refactor to use immutable variables and functional programming style

The current implementation uses mutable variables (var). Refactoring to use immutable variables and functional constructs can improve code clarity and reduce side effects.

For example, consider using recursive functions or higher-order functions like foldLeft to eliminate mutable state.

16-57: Add comprehensive unit tests to cover edge cases

To ensure robustness, add unit tests that cover various edge cases, such as:

Empty ptiles or breaks arrays

ptiles with consecutive equal values

breaks outside the range of ptiles

Single-element arrays

Extremely large or small values

Would you like assistance in writing these unit tests?
online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala (5)
119-127: Refactor to avoid mutable variables for better functional style.

The use of mutable variables i and equalityHits in the computePDF method can be refactored to a more functional style, enhancing readability and aligning with Scala best practices.

Consider refactoring using immutable data structures and functional constructs:
def computePDF(percentiles: Array[Double], breaks: Array[Double]): Array[Mass] = {
  val n = percentiles.length
  require(n > 2, "Need at least 3 percentiles to plot a distribution")

  val interval: Double = 1.0 / (n - 1)

  breaks.map { break =>
    val indices = percentiles.indices.filter(i => percentiles(i) <= break)
    val equalityHits = indices.count(i => percentiles(i) == break)
    val i = indices.lastOption.getOrElse(0)

    val massValue = (i, equalityHits) match {
      case (0, _)           => 0.0 // before range
      case (x, 0) if x >= n => 0.0 // after range
      case (_, e) if e > 1  => (e - 1) * interval // point mass
      case (x, 1) if x == n => interval / (percentiles(n - 1) - percentiles(n - 2)) // exactly at end
      case (x, _)           => interval / (percentiles(x) - percentiles(x - 1)) // in between
    }
    Mass(massValue, equalityHits > 1)
  }
}
42-47: Enhance documentation for functionBuilder method.

The functionBuilder method is a higher-order function combining binning and distance functions. Adding scaladoc comments with usage examples would improve understandability.

Consider adding documentation like:
/**
 * Constructs a distance computation function from a binning function and a distance function.
 *
 * @param binningFunc  Function to bin the comparison data into distributions.
 * @param distanceFunc Function to compute the distance between distributions.
 * @tparam T Type of the data being compared.
 * @return A function that computes the distance given a Comparison[T].
 */
def functionBuilder[T](binningFunc: Comparison[T] => Distributions,
                       distanceFunc: Distributions => Double): Comparison[T] => Double = { c =>
  val dists = binningFunc(c)
  distanceFunc(dists)
}
40-41: Clarify generic type constraints for Comparison[T].

The Comparison[T] case class uses a generic type T without constraints. Specifying bounds or adding type documentation would help prevent misuse.

Consider specifying type requirements or documenting expected types.
case class Comparison[T](previous: T, current: T, timeDelta: Window)
49-140: Add input validation for empty or unequal length arrays.

The distance functions assume that input arrays are non-empty and of equal length. Adding validation checks will prevent unexpected runtime errors.

Apply this diff to add validation:
 def hellingerDistance(p: Array[Double], q: Array[Double]): Double = {
+  require(p.nonEmpty && q.nonEmpty, "Input arrays must be non-empty.")
+  require(p.length == q.length, s"Input arrays must have the same length: p.length=${p.length}, q.length=${q.length}")
   val (pProbs, qProbs) = computePDFs(p, q)
   // ...
 }
Repeat similar checks for populationStabilityIndex and jensenShannonDivergence.

6-141: Implement unit tests for distance metric functions.

Currently, there are no unit tests for the distance metric functions, which are critical for ensuring correctness.

Do you want assistance in creating unit tests for these functions? Proper unit tests will help in catching edge cases and validating the implementation.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 440e520 and 2385e36.

📒 Files selected for processing (4)

online/src/main/scala/ai/chronon/online/stats/AssignIntervals.scala (1 hunks)
online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala (1 hunks)
online/src/test/scala/ai/chronon/online/test/stats/AssignIntervalsTest.scala (1 hunks)
spark/src/main/scala/ai/chronon/spark/stats/drift/DistanceMetrics.scala (0 hunks)

💤 Files with no reviewable changes (1)

spark/src/main/scala/ai/chronon/spark/stats/drift/DistanceMetrics.scala

🔇 Additional comments (1)

online/src/test/scala/ai/chronon/online/test/stats/AssignIntervalsTest.scala (1)

1-6: LGTM! Clean and minimal imports.

The package structure and imports are well-organized and contain only the necessary dependencies.

coderabbitai · 2024-11-07T02:39:57Z

online/src/test/scala/ai/chronon/online/test/stats/AssignIntervalsTest.scala

+      expected.zip(result).foreach{
+        case (e, r) => println(s"exp: $e res: $r")
+          r shouldEqual e
+      }


🛠️ Refactor suggestion

Remove debug print statements and improve assertion readability.

Production tests should not contain debug print statements. Also, the assertion could be more readable using ScalaTest's built-in collection matchers.

- expected.zip(result).foreach{ - case (e, r) => println(s"exp: $e res: $r") - r shouldEqual e - } + result should have length expected.length + result shouldBe expected

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

expected.zip(result).foreach{

case (e, r) => println(s"exp: $e res: $r")

r shouldEqual e

}

result should have length expected.length

result shouldBe expected

coderabbitai · 2024-11-07T02:39:58Z

online/src/main/scala/ai/chronon/online/stats/AssignIntervals.scala

+  def on(ptiles: Array[Double], breaks: Array[Double]): Array[Double] = {
+    var i = 0
+    var prev = breaks.head
+    var bIndex = 1
+    val bLen = breaks.length
+    val pLen = ptiles.length
+    val arr = Array.fill(bLen - 1)(0.0)


⚠️ Potential issue

Add input validation for empty arrays

Currently, the method assumes that ptiles and breaks arrays are non-empty. Accessing breaks.head and breaks(0) without checks can lead to exceptions if breaks is empty. Similarly, operations on ptiles may fail if it's empty. Add input validation to handle empty arrays gracefully.

Apply this diff to add input validation:

def on(ptiles: Array[Double], breaks: Array[Double]): Array[Double] = { + if (breaks.isEmpty || ptiles.isEmpty) { + throw new IllegalArgumentException("Input arrays 'breaks' and 'ptiles' must not be empty") + } var i = 0 var prev = breaks.head var bIndex = 1 val bLen = breaks.length val pLen = ptiles.length val arr = Array.fill(bLen - 1)(0.0)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def on(ptiles: Array[Double], breaks: Array[Double]): Array[Double] = {

var i = 0

var prev = breaks.head

var bIndex = 1

val bLen = breaks.length

val pLen = ptiles.length

val arr = Array.fill(bLen - 1)(0.0)

def on(ptiles: Array[Double], breaks: Array[Double]): Array[Double] = {

if (breaks.isEmpty || ptiles.isEmpty) {

throw new IllegalArgumentException("Input arrays 'breaks' and 'ptiles' must not be empty")

}

var i = 0

var prev = breaks.head

var bIndex = 1

val bLen = breaks.length

val pLen = ptiles.length

val arr = Array.fill(bLen - 1)(0.0)

coderabbitai · 2024-11-07T02:39:58Z

online/src/main/scala/ai/chronon/online/stats/AssignIntervals.scala

+        // fraction of the equally spaced interval between p(i-1) to p(i) to assign to this break
+        val fraction =
+          if (i == 0) 0.0
+          else if (ptiles(i) == ptiles(i - 1)) 1.0
+          else (ptiles(i) - prev) / (ptiles(i) - ptiles(i - 1))
+
+        result += fraction
+        prev = ptiles(i)
+        i += 1
+      }
+
+      // residual fraction to assign
+      val fraction =
+        if (i <= 0 || i >= pLen) 0.0
+        else (b - prev) / (ptiles(i) - ptiles(i - 1))
+


⚠️ Potential issue

Protect against potential division by zero in fractional calculations

In the calculations at lines 39 and 49, there's a risk of division by zero if ptiles(i) - ptiles(i - 1) equals zero due to floating-point precision errors. Even though you check for exact equality at line 38, using a small epsilon value for comparison accounts for floating-point inaccuracies.

Modify the comparisons to include an epsilon:

val fraction = if (i == 0) 0.0 - else if (ptiles(i) == ptiles(i - 1)) 1.0 + else if (math.abs(ptiles(i) - ptiles(i - 1)) < 1e-10) 1.0 else (ptiles(i) - prev) / (ptiles(i) - ptiles(i - 1))

Similarly, adjust the calculation at line 49:

val fraction = if (i <= 0 || i >= pLen) 0.0 + else if (math.abs(ptiles(i) - ptiles(i - 1)) < 1e-10) 1.0 else (b - prev) / (ptiles(i) - ptiles(i - 1))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// fraction of the equally spaced interval between p(i-1) to p(i) to assign to this break

val fraction =

if (i == 0) 0.0

else if (ptiles(i) == ptiles(i - 1)) 1.0

else (ptiles(i) - prev) / (ptiles(i) - ptiles(i - 1))

result += fraction

prev = ptiles(i)

i += 1

}

// residual fraction to assign

val fraction =

if (i <= 0 || i >= pLen) 0.0

else (b - prev) / (ptiles(i) - ptiles(i - 1))

// fraction of the equally spaced interval between p(i-1) to p(i) to assign to this break

val fraction =

if (i == 0) 0.0

else if (math.abs(ptiles(i) - ptiles(i - 1)) < 1e-10) 1.0

else (ptiles(i) - prev) / (ptiles(i) - ptiles(i - 1))

result += fraction

prev = ptiles(i)

i += 1

}

// residual fraction to assign

val fraction =

if (i <= 0 || i >= pLen) 0.0

else if (math.abs(ptiles(i) - ptiles(i - 1)) < 1e-10) 1.0

else (b - prev) / (ptiles(i) - ptiles(i - 1))

coderabbitai · 2024-11-07T02:39:58Z

online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala

+  def klDivergence(p: Array[Double], q: Array[Double]): Double = {
+    require(p.length == q.length, s"Inputs are of different length ${p.length}, ${q.length}")
+    var i = 0
+    var result = 0.0
+    while (i < p.length) {
+      val inc = if (p(i) > 0 && q(i) > 0) p(i) * math.log(p(i) / q(i)) else 0
+      result += inc
+      i += 1
+    }
+    result
+  }


⚠️ Potential issue

Correct the handling of zero probabilities in KL divergence calculation.

In the klDivergence function, when q(i) is zero and p(i) is not zero, the divergence should approach infinity, as logarithm of zero is undefined. The current implementation sets the increment to zero in such cases, which is mathematically inaccurate.

Apply this diff to properly handle zero probabilities:

def klDivergence(p: Array[Double], q: Array[Double]): Double = { require(p.length == q.length, s"Inputs are of different length ${p.length}, ${q.length}") var i = 0 var result = 0.0 while (i < p.length) { - val inc = if (p(i) > 0 && q(i) > 0) p(i) * math.log(p(i) / q(i)) else 0 + val inc = (p(i), q(i)) match { + case (pi, qi) if pi > 0 && qi > 0 => + pi * math.log(pi / qi) + case (pi, 0) if pi > 0 => + Double.PositiveInfinity + case _ => 0.0 + } result += inc i += 1 } result }

Alternatively, consider adding a small epsilon to q(i) to prevent division by zero:

val epsilon = 1e-10 val adjustedQi = q(i) + epsilon

coderabbitai · 2024-11-07T02:39:58Z

online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala

+  // TODO move this to unit test
+  def main(args: Array[String]): Unit = {
+    val A = Array(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200).map(
+      _.toDouble)
+    val B = Array(5, 15, 25, 35, 45, 55, 65, 75, 85, 95, 115, 115, 115, 135, 145, 155, 165, 1175, 1205, 1205, 1205).map(
+      _.toDouble)
+
+    val jsd = jensenShannonDivergence(A, B)
+    println(f"The Jensen-Shannon Divergence between distributions A and B is: $jsd%.5f")
+    val psi = populationStabilityIndex(A, B)
+    println(f"The Population Stability Index between distributions A and B is: $psi%.5f")
+    val hd = hellingerDistance(A, B)
+    println(f"The Hellinger Distance between distributions A and B is: $hd%.5f")
+
+    // format: off
+    // aligned vertically for easier reasoning
+    val ptiles = Array(   1,      4,   6,6,6,    8, 9      )
+    val breaks = Array(0, 1, 2, 3,  5, 6,     7, 8, 9,   10)
+    // format: on
+
+    //val interval = 0.25
+    val expected = Array(0.0, 1.0 / 3.0, 1.0 / 3.0, (1.0) / (3.0) + (1.0) / (2.0), (1.0) / (2.0), 2.5, 0.5, 1, 0)
+
+    val result = AssignIntervals.on(ptiles = ptiles.map(_.toDouble), breaks = breaks.map(_.toDouble))
+
+    expected.zip(result).foreach { case (e, r) => println(s"exp: $e res: $r") }
+
+  }


🛠️ Refactor suggestion

Move test code from main method to dedicated unit tests.

The main method is being used for testing purposes, as indicated by the TODO comment on line 8. It's best practice to move test code into unit tests to enhance maintainability and allow automated testing frameworks to pick them up.

Apply this diff to remove the main method and consider implementing unit tests instead:

- // TODO move this to unit test - def main(args: Array[String]): Unit = { - val A = Array(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200).map( - _.toDouble) - val B = Array(5, 15, 25, 35, 45, 55, 65, 75, 85, 95, 115, 115, 115, 135, 145, 155, 165, 1175, 1205, 1205, 1205).map( - _.toDouble) - - val jsd = jensenShannonDivergence(A, B) - println(f"The Jensen-Shannon Divergence between distributions A and B is: $jsd%.5f") - val psi = populationStabilityIndex(A, B) - println(f"The Population Stability Index between distributions A and B is: $psi%.5f") - val hd = hellingerDistance(A, B) - println(f"The Hellinger Distance between distributions A and B is: $hd%.5f") - - // Additional testing code... - }

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2024-11-07T02:39:58Z

online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala

+      def indexMass(i: Int): Double = interval / (percentiles(i) - percentiles(i - 1))
+      val isPointMass = eh > 1
+      val m = (i, eh) match {
+        case (0, _)           => 0.0 // before range
+        case (x, 0) if x >= n => 0.0 // after range
+        case (_, e) if e > 1  => (e - 1) * interval // point mass
+        case (x, 1) if x == n => indexMass(n - 1) // exactly at end of range
+        case (x, _)           => indexMass(x) // somewhere in between
+      }


⚠️ Potential issue

Prevent division by zero in indexMass function.

In indexMass, there's a potential for division by zero when percentiles(i) == percentiles(i - 1). This would cause a ArithmeticException.

Apply this diff to handle the edge case:

def indexMass(i: Int): Double = { val denominator = percentiles(i) - percentiles(i - 1) - interval / denominator + if (denominator != 0) interval / denominator else 0.0 }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def indexMass(i: Int): Double = interval / (percentiles(i) - percentiles(i - 1))

val isPointMass = eh > 1

val m = (i, eh) match {

case (0, _) => 0.0 // before range

case (x, 0) if x >= n => 0.0 // after range

case (_, e) if e > 1 => (e - 1) * interval // point mass

case (x, 1) if x == n => indexMass(n - 1) // exactly at end of range

case (x, _) => indexMass(x) // somewhere in between

}

def indexMass(i: Int): Double = {

val denominator = percentiles(i) - percentiles(i - 1)

if (denominator != 0) interval / denominator else 0.0

}

val isPointMass = eh > 1

val m = (i, eh) match {

case (0, _) => 0.0 // before range

case (x, 0) if x >= n => 0.0 // after range

case (_, e) if e > 1 => (e - 1) * interval // point mass

case (x, 1) if x == n => indexMass(n - 1) // exactly at end of range

case (x, _) => indexMass(x) // somewhere in between

}

coderabbitai

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (1)

online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala (1)

74-101: Optimize categorical distance calculation.

The current implementation iterates over both histograms separately. Consider using a more efficient approach with a single pass over the combined keyset.

Apply this diff:

   def categoricalDistance(a: Histogram, b: Histogram, metric: DriftMetric): Double = {
-    val aIt = a.entrySet().iterator()
     var result = 0.0
     val func = termFunc(metric)
-    while (aIt.hasNext) {
-      val entry = aIt.next()
-      val key = entry.getKey
-      val aVal = entry.getValue.toDouble
-      val bValOpt = b.get(key)
-      val bVal = if (bValOpt == null) 0.0 else bValOpt.toDouble
-      val term = func(aVal, bVal)
-      result += term
-    }
-
-    val bIt = b.entrySet().iterator()
-    while (bIt.hasNext) {
-      val entry = bIt.next()
-      val key = entry.getKey
-      val bVal = entry.getValue.toDouble
-      val aValOpt = a.get(key)
-      if (aValOpt == null) {
-        result += func(0.0, bVal)
-      }
-    }
+    val allKeys = (a.keySet() ++ b.keySet()).asScala
+    allKeys.foreach { key =>
+      val aVal = Option(a.get(key)).map(_.toDouble).getOrElse(0.0)
+      val bVal = Option(b.get(key)).map(_.toDouble).getOrElse(0.0)
+      result += func(aVal, bVal)
+    }
     result
   }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 2385e36 and 70dcfaf.

📒 Files selected for processing (1)

online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala (1 hunks)

🔇 Additional comments (2)

online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala (2)

10-23: Move test code to dedicated unit tests.

The existing TODO comment and test code in the main method should be moved to proper unit tests.

104-106: Correct the handling of zero probabilities in KL divergence calculation.

The current implementation of klDivergenceTerm doesn't properly handle the case where q is zero and p is non-zero.

coderabbitai · 2024-11-07T07:17:07Z

online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala

+      val key = entry.getKey
+      val aVal = entry.getValue.toDouble
+      val bValOpt = b.get(key)
+      val bVal = if (bValOpt == null) bValOpt.toDouble else 0.0


⚠️ Potential issue

Fix incorrect bVal calculation logic.

The bVal calculation is incorrect. It's using the null value instead of the actual value.

Apply this diff:

- val bVal = if (bValOpt == null) bValOpt.toDouble else 0.0 + val bVal = if (bValOpt == null) 0.0 else bValOpt.toDouble

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

val bVal = if (bValOpt == null) bValOpt.toDouble else 0.0

val bVal = if (bValOpt == null) 0.0 else bValOpt.toDouble

coderabbitai · 2024-11-07T07:17:07Z

online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala

+  private def normalizeInplace(arr: Array[Double]): Array[Double] = {
+    val sum = arr.sum
+    var i = 0
+    while (i < arr.length) {
+      arr.update(i, arr(i) / sum)
+      i += 1
+    }
+    arr
+  }


⚠️ Potential issue

Handle edge cases in normalizeInplace function.

The function has two potential issues:

No handling of zero sum case which could lead to division by zero

Modifies array in-place but this isn't clear from the function name

Apply this diff to improve the implementation:

- private def normalizeInplace(arr: Array[Double]): Array[Double] = { + private def normalizeArrayInplace(arr: Array[Double]): Array[Double] = { val sum = arr.sum + if (sum == 0.0) return arr var i = 0 while (i < arr.length) { arr.update(i, arr(i) / sum) i += 1 } arr }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

private def normalizeInplace(arr: Array[Double]): Array[Double] = {

val sum = arr.sum

var i = 0

while (i < arr.length) {

arr.update(i, arr(i) / sum)

i += 1

}

arr

}

private def normalizeArrayInplace(arr: Array[Double]): Array[Double] = {

val sum = arr.sum

if (sum == 0.0) return arr

var i = 0

while (i < arr.length) {

arr.update(i, arr(i) / sum)

i += 1

}

arr

}

coderabbitai · 2024-11-07T07:17:07Z

online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala

+  def percentileDistance(a: Array[Double], b: Array[Double], metric: DriftMetric): Double = {
+    val breaks = (a ++ b).sorted.distinct
+    val aProjected = AssignIntervals.on(a, breaks)
+    val bProjected = AssignIntervals.on(b, breaks)
+
+    val aNormalized = normalizeInplace(aProjected)
+    val bNormalized = normalizeInplace(bProjected)
+
+    val func = termFunc(metric)
+
+    var i = 0
+    var result = 0.0
+    while (i < aNormalized.length) {
+      result += func(aNormalized(i), bNormalized(i))
+      i += 1
+    }
+    result
+  }


⚠️ Potential issue

Add input validation in percentileDistance.

The method should validate input arrays before processing.

Apply this diff:

def percentileDistance(a: Array[Double], b: Array[Double], metric: DriftMetric): Double = { + require(a != null && b != null, "Input arrays cannot be null") + require(a.nonEmpty && b.nonEmpty, "Input arrays cannot be empty") val breaks = (a ++ b).sorted.distinct val aProjected = AssignIntervals.on(a, breaks) val bProjected = AssignIntervals.on(b, breaks)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def percentileDistance(a: Array[Double], b: Array[Double], metric: DriftMetric): Double = {

val breaks = (a ++ b).sorted.distinct

val aProjected = AssignIntervals.on(a, breaks)

val bProjected = AssignIntervals.on(b, breaks)

val aNormalized = normalizeInplace(aProjected)

val bNormalized = normalizeInplace(bProjected)

val func = termFunc(metric)

var i = 0

var result = 0.0

while (i < aNormalized.length) {

result += func(aNormalized(i), bNormalized(i))

i += 1

}

result

}

def percentileDistance(a: Array[Double], b: Array[Double], metric: DriftMetric): Double = {

require(a != null && b != null, "Input arrays cannot be null")

require(a.nonEmpty && b.nonEmpty, "Input arrays cannot be empty")

val breaks = (a ++ b).sorted.distinct

val aProjected = AssignIntervals.on(a, breaks)

val bProjected = AssignIntervals.on(b, breaks)

val aNormalized = normalizeInplace(aProjected)

val bNormalized = normalizeInplace(bProjected)

val func = termFunc(metric)

var i = 0

var result = 0.0

while (i < aNormalized.length) {

result += func(aNormalized(i), bNormalized(i))

i += 1

}

result

}

coderabbitai · 2024-11-07T07:17:07Z

online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala

+  def distance(a: TileSummaries, b: TileSummaries, metric: DriftMetric): java.lang.Double = {
+    require(a.isSetPercentiles == b.isSetPercentiles, "Percentiles should be either set or unset together")
+    require(a.isSetHistogram == b.isSetHistogram, "Histograms should be either set or unset together")
+
+    val isContinuous = a.isSetPercentiles && b.isSetPercentiles
+    val isCategorical = a.isSetHistogram && b.isSetHistogram
+    if (isContinuous)
+      percentileDistance(toArray(a.getPercentiles), toArray(b.getPercentiles), metric)
+    else if (isCategorical)
+      categoricalDistance(a.getHistogram, b.getHistogram, metric)
+    else
+      null
+  }


⚠️ Potential issue

Replace null with Option in distance method.

Returning null is unsafe in Scala. Consider using Option instead.

Apply this diff:

- def distance(a: TileSummaries, b: TileSummaries, metric: DriftMetric): java.lang.Double = { + def distance(a: TileSummaries, b: TileSummaries, metric: DriftMetric): Option[Double] = { require(a.isSetPercentiles == b.isSetPercentiles, "Percentiles should be either set or unset together") require(a.isSetHistogram == b.isSetHistogram, "Histograms should be either set or unset together") val isContinuous = a.isSetPercentiles && b.isSetPercentiles val isCategorical = a.isSetHistogram && b.isSetHistogram if (isContinuous) - percentileDistance(toArray(a.getPercentiles), toArray(b.getPercentiles), metric) + Some(percentileDistance(toArray(a.getPercentiles), toArray(b.getPercentiles), metric)) else if (isCategorical) - categoricalDistance(a.getHistogram, b.getHistogram, metric) + Some(categoricalDistance(a.getHistogram, b.getHistogram, metric)) else - null + None }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def distance(a: TileSummaries, b: TileSummaries, metric: DriftMetric): java.lang.Double = {

require(a.isSetPercentiles == b.isSetPercentiles, "Percentiles should be either set or unset together")

require(a.isSetHistogram == b.isSetHistogram, "Histograms should be either set or unset together")

val isContinuous = a.isSetPercentiles && b.isSetPercentiles

val isCategorical = a.isSetHistogram && b.isSetHistogram

if (isContinuous)

percentileDistance(toArray(a.getPercentiles), toArray(b.getPercentiles), metric)

else if (isCategorical)

categoricalDistance(a.getHistogram, b.getHistogram, metric)

else

null

}

def distance(a: TileSummaries, b: TileSummaries, metric: DriftMetric): Option[Double] = {

require(a.isSetPercentiles == b.isSetPercentiles, "Percentiles should be either set or unset together")

require(a.isSetHistogram == b.isSetHistogram, "Histograms should be either set or unset together")

val isContinuous = a.isSetPercentiles && b.isSetPercentiles

val isCategorical = a.isSetHistogram && b.isSetHistogram

if (isContinuous)

Some(percentileDistance(toArray(a.getPercentiles), toArray(b.getPercentiles), metric))

else if (isCategorical)

Some(categoricalDistance(a.getHistogram, b.getHistogram, metric))

else

None

}

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (4)

online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala (4)
235-240: Ensure match in termFunc is exhaustive

The termFunc function does not handle unexpected DriftMetric values, which could lead to a MatchError if a new metric is introduced. Consider adding a default case to handle unsupported metrics.

Apply this diff to make the match exhaustive:
def termFunc(d: DriftMetric): (Double, Double) => Double =
  d match {
    case DriftMetric.PSI            => psiTerm
    case DriftMetric.HELLINGER      => hellingerTerm
    case DriftMetric.JENSEN_SHANNON => jsdTerm
+   case _                          => throw new IllegalArgumentException(s"Unsupported DriftMetric: $d")
  }
138-138: Manage debug code appropriately before merging

There are debug statements and comments indicating they should be removed before merging. Instead of removing, consider using a proper logging framework with configurable log levels to manage debug output effectively.

Implement logging to enable or disable debug output without code changes. Replace println statements with logger calls:
+ import org.slf4j.LoggerFactory

object DistanceMetrics {
+ private val logger = LoggerFactory.getLogger(DistanceMetrics.getClass)

  // TODO move this to unit test
  def main(args: Array[String]): Unit = {
    // ...

    // Debug code management
    if (debug) {
-     println(f"""
+     logger.debug(f"""
           |aProjected : ${printArr(aProjected)}
           |bProjected : ${printArr(bProjected)}
           |aNormalized: ${printArr(aNormalized)}
           |bNormalized: ${printArr(bNormalized)}
           |deltas     : ${printArr(deltas)}
           |result     : $result%.4f
           |""".stripMargin)
    }

    // ...
  }
}
Ensure you have the appropriate logging library included in your project's dependencies.

Also applies to: 146-146, 153-164

19-22: Parameterize magic numbers for clarity in buildPercentiles

The hardcoded probability points 0.01 and 0.99 could be defined as constants or parameters to improve code readability and maintainability.

Apply this diff to define constants:
def buildPercentiles(mean: Double, variance: Double, breaks: Int = 20): Array[Double] = {
+ val MIN_PROB = 0.01
+ val MAX_PROB = 0.99
  val stdDev = math.sqrt(variance)

  // Create probability points from 0.01 to 0.99 instead of 0 to 1
  val probPoints = (0 to breaks).map { i =>
-   if (i == 0) 0.01 // p1 instead of p0
-   else if (i == breaks) 0.99 // p99 instead of p100
+   if (i == 0) MIN_PROB // p1 instead of p0
+   else if (i == breaks) MAX_PROB // p99 instead of p100
    else i.toDouble / breaks
  }.toArray

  // ...
}
11-12: Consider moving testing code to dedicated unit tests

The main method and compareDistributions function are being used for testing purposes. It's best practice to move such code into unit tests to enhance maintainability and allow automated testing frameworks to execute them.

Would you like assistance in creating unit tests for these functions?

Also applies to: 45-90

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 70dcfaf and 580b6bf.

📒 Files selected for processing (1)

online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala (1 hunks)

coderabbitai · 2024-11-08T03:01:09Z

online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala

+    val aSum = sumValues(a)
+    val bSum = sumValues(b)


⚠️ Potential issue

Prevent division by zero in histogramDistance function

In the histogramDistance function, if aSum or bSum is zero, dividing by zero will occur when normalizing histogram values. Add checks to ensure that aSum and bSum are not zero before performing divisions to prevent potential runtime exceptions.

Apply this diff to handle zero sums:

val aSum = sumValues(a) val bSum = sumValues(b) + if (aSum == 0.0 || bSum == 0.0) { + // Handle zero sum, e.g., return maximum distance or raise an exception + return Double.PositiveInfinity + } val aIt = a.entrySet().iterator() var result = 0.0 val func = termFunc(metric) while (aIt.hasNext) { // ... } val bIt = b.entrySet().iterator() while (bIt.hasNext) { // ... }

Also applies to: 192-205

coderabbitai · 2024-11-08T03:01:09Z

online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala

+  private def normalize(arr: Array[Double]): Array[Double] = {
+    // TODO-OPTIMIZATION: normalize in place instead if this is a hotspot
+    val result = Array.ofDim[Double](arr.length)
+    val sum = arr.sum
+    var i = 0
+    while (i < arr.length) {
+      result.update(i, arr(i) / sum)
+      i += 1
+    }
+    result
+  }


⚠️ Potential issue

Handle division by zero in normalize function

In the normalize function, if the sum of the array elements is zero, a division by zero will occur when dividing each element by sum. Consider adding a check to handle the case when sum is zero to avoid potential runtime exceptions.

Apply this diff to handle the zero-sum case:

def normalize(arr: Array[Double]): Array[Double] = { // TODO-OPTIMIZATION: normalize in place instead if this is a hotspot val result = Array.ofDim[Double](arr.length) val sum = arr.sum + if (sum == 0.0) { + // Handle zero sum, e.g., return zeroed array or raise an exception + return result + } var i = 0 while (i < arr.length) { result.update(i, arr(i) / sum) i += 1 } result }

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai

Actionable comments posted: 7

🧹 Outside diff range and nitpick comments (8)

online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala (3)
13-21: Consider adding null checks for Java List conversions.

The conversion methods could throw NullPointerException if the input List or any of its elements are null. Consider adding null checks or using Option for safer handling.
 @inline
 private def toArray(l: java.util.List[java.lang.Double]): Array[Double] = {
+  if (l == null) return Array.empty[Double]
   l.iterator().toScala.map(_.toDouble).toArray
 }

 @inline
 private def toArray(l: java.util.List[java.lang.Integer]): Array[Int] = {
+  if (l == null) return Array.empty[Int]
   l.iterator().toScala.map(_.toInt).toArray
 }
23-24: Consider enforcing temporal order through parameter naming.

The comment indicates 'a is after, b is before', but this convention isn't enforced through parameter names. Consider renaming parameters to make the temporal relationship explicit.
-def between(a: TileSummaries, b: TileSummaries, metric: DriftMetric = DriftMetric.JENSEN_SHANNON): TileDrift = {
+def between(after: TileSummaries, before: TileSummaries, metric: DriftMetric = DriftMetric.JENSEN_SHANNON): TileDrift = {
23-80: Add documentation for the between method.

The method would benefit from comprehensive ScalaDoc documentation explaining:

Parameter descriptions and requirements

Return value contents

Calculation methodology

Error conditions
/** Calculates drift metrics between two TileSummaries instances.
  *
  * @param a The "after" TileSummaries instance
  * @param b The "before" TileSummaries instance
  * @param metric The drift metric to use for calculations (defaults to JENSEN_SHANNON)
  * @return A TileDrift instance containing all calculated drift metrics
  * @throws IllegalArgumentException if input parameters are null or invalid
  */
online/src/test/scala/ai/chronon/online/test/stats/DistanceMetricsTest.scala (2)

13-26: Add documentation for the statistical approach.

The buildPercentiles method uses a statistical approach to generate percentile points. Consider adding documentation to explain:

Why 0.01 and 0.99 are used as boundary probabilities

The statistical theory behind the normal distribution transformation

28-37: Consider improving the inverse error function implementation.

The current implementation:

Uses a magic number (0.147) without explanation

Could benefit from a more robust implementation or using an established library

Consider:

Adding documentation explaining the approximation method

Using Apache Commons Math or other statistical libraries for a more robust implementation

Adding validity checks for input range (-1 < x < 1)
api/thrift/api.thrift (2)
268-274: Enhance struct documentation.

The current comment could be more descriptive. Consider adding:

Purpose and use cases of DriftParams

Description of each field

Example usage

Add documentation like this:
-// service takes this, along with column & slice and produces a series of tileDrift objects
+/**
+ * Parameters for calculating drift metrics between two time periods.
+ *
+ * @field metric: The statistical method used to measure drift (see DriftMetric enum for thresholds)
+ * @field lookbacks: List of time windows to analyze drift over
+ * @field startMillis: Start timestamp in milliseconds for the analysis period
+ * @field endMillis: End timestamp in milliseconds for the analysis period
+ *
+ * Example:
+ * {
+ *   metric: JENSEN_SHANNON,
+ *   lookbacks: [{length: 7, timeUnit: DAYS}],
+ *   startMillis: 1635724800000,  // 2021-11-01
+ *   endMillis: 1638316800000     // 2021-12-01
+ * }
+ */
276-287: Add comprehensive documentation.

The struct lacks documentation explaining the purpose and interpretation of each drift metric.

Add documentation like this:
+/**
+ * Represents various drift metrics computed between two TileSummaries.
+ * Each metric captures a different aspect of distribution change:
+ *
+ * Basic Distribution Metrics:
+ * @field percentileDrift: Drift in the distribution of values based on percentile comparison
+ * @field histogramDrift: Drift in the frequency distribution of values
+ * @field countDrift: Relative change in the total count of records
+ * @field nullRatioDrift: Change in the proportion of null values
+ *
+ * Container Type Metrics:
+ * @field innerCountDrift: Drift in the total number of entries within container types
+ * @field innerNullCountDrift: Drift in the number of null entries within container types
+ * @field lengthPercentilesDrift: Drift in the distribution of container lengths
+ *
+ * String Type Metrics:
+ * @field stringLengthPercentilesDrift: Drift in the distribution of string lengths
+ *
+ * Note: All drift values are between 0 and 1, where:
+ * - 0 indicates no drift (distributions are identical)
+ * - 1 indicates maximum drift (distributions are completely different)
+ */
online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala (1)
95-104: Optimize normalize function by normalizing in place.

The comment in the code suggests optimizing the normalize function by normalizing in place if it's a performance hotspot. Modifying the array in place can reduce memory usage and improve performance.

Consider applying this change:
-  private def normalize(arr: Array[Double]): Array[Double] = {
-    // TODO-OPTIMIZATION: normalize in place instead if this is a hotspot
-    val result = Array.ofDim[Double](arr.length)
+  private def normalizeInPlace(arr: Array[Double]): Array[Double] = {
+    val sum = arr.sum
+    if (sum == 0.0) return arr
     var i = 0
     while (i < arr.length) {
-      result.update(i, arr(i) / sum)
+      arr.update(i, arr(i) / sum)
       i += 1
     }
-    result
+    arr
   }
Note: Ensure that modifying the input array in place does not have unintended side effects elsewhere in the codebase.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 580b6bf and acee9b2.

📒 Files selected for processing (4)

api/thrift/api.thrift (1 hunks)
online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala (1 hunks)
online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala (1 hunks)
online/src/test/scala/ai/chronon/online/test/stats/DistanceMetricsTest.scala (1 hunks)

🔇 Additional comments (6)

online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala (1)

1-10: LGTM! Well-organized imports and package structure.

The imports are properly organized and specific to the requirements. Good use of ScalaJavaConversions for Java-Scala interoperability.

online/src/test/scala/ai/chronon/online/test/stats/DistanceMetricsTest.scala (2)

1-10: LGTM! Package structure and imports are well organized.

The imports are specific and properly structured for the test requirements.

1-128: Verify consistency with other drift metric implementations.

As this is part of a WIP PR for drift metrics, let's verify the consistency of implementations across the codebase.

✅ Verification successful

Let me analyze the implementations further by checking the actual code in the related files.

Test implementation aligns with production code

The verification shows that the test implementation in DistanceMetricsTest correctly validates the behavior of the production code:

The thresholds used in tests (0.05 for low-moderate, 0.15 for moderate-severe) match the defined thresholds in DistanceMetrics.thresholds
The test covers both percentileDistance and histogramDistance methods which are used in production by TileDriftCalculator
The test cases validate different drift scenarios (low, moderate, severe) that align with the production thresholds
The implementation consistently uses the same drift metrics (JENSEN_SHANNON, HELLINGER, PSI) across test and production code

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for other drift metric implementations and thresholds
echo "Searching for drift metric implementations..."
rg -l "DriftMetric" --type scala

echo "\nSearching for threshold definitions..."
rg "val.*threshold.*=.*0\.(05|15)" --type scala

echo "\nSearching for other statistical distance calculations..."
rg "(percentileDistance|histogramDistance)" --type scala

Length of output: 2533

Script:

#!/bin/bash
# Check the implementation of DistanceMetrics
echo "=== DistanceMetrics.scala ==="
cat online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala

echo -e "\n=== TileDriftCalculator.scala ==="
cat online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala

Length of output: 7789

online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala (3)

94-104: Handle potential division by zero in normalize function.

In the normalize function, if the sum of the array elements is zero, dividing by sum will result in a division by zero error. Please add a check to handle the case when sum is zero to prevent potential runtime exceptions.

62-63: Prevent division by zero in histogramDistance function.

In the histogramDistance function, if aSum or bSum is zero, dividing by zero will occur when normalizing histogram values. Ensure that aSum and bSum are checked for zero before performing divisions to avoid potential errors.

107-109: Correct the handling of zero probabilities in KL divergence calculation.

In the klDivergenceTerm function, when b is zero and a is not zero, the divergence should approach infinity, as the logarithm of zero is undefined. The current implementation sets the term to zero in such cases, which is mathematically inaccurate.

coderabbitai · 2024-11-08T06:43:50Z

online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala

+    if (a.isSetPercentiles && b.isSetPercentiles) {
+      val drift = percentileDistance(toArray(a.getPercentiles), toArray(b.getPercentiles), metric)
+      result.setPercentilDrift(drift)
+    }


⚠️ Potential issue

Add validation for percentile array lengths.

When calculating percentile distance, there's no validation that both arrays have matching lengths, which could lead to incorrect calculations or index out of bounds errors.

if (a.isSetPercentiles && b.isSetPercentiles) { + val aPercentiles = toArray(a.getPercentiles) + val bPercentiles = toArray(b.getPercentiles) + require(aPercentiles.length == bPercentiles.length, "Percentile arrays must have matching lengths") - val drift = percentileDistance(toArray(a.getPercentiles), toArray(b.getPercentiles), metric) + val drift = percentileDistance(aPercentiles, bPercentiles, metric) result.setPercentilDrift(drift) }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (a.isSetPercentiles && b.isSetPercentiles) {

val drift = percentileDistance(toArray(a.getPercentiles), toArray(b.getPercentiles), metric)

result.setPercentilDrift(drift)

}

if (a.isSetPercentiles && b.isSetPercentiles) {

val aPercentiles = toArray(a.getPercentiles)

val bPercentiles = toArray(b.getPercentiles)

require(aPercentiles.length == bPercentiles.length, "Percentile arrays must have matching lengths")

val drift = percentileDistance(aPercentiles, bPercentiles, metric)

result.setPercentilDrift(drift)

}

coderabbitai · 2024-11-08T06:43:50Z

online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala

+    if (a.isSetCount && b.isSetCount && b.getCount != 0.0) {
+      // preserving sign, since direction of change is usually relevant
+      val change = (a.getCount - b.getCount).toDouble / b.getCount
+      result.setCountChangePercent(change * 100)
+    }
+
+    if (a.isSetNullCount && b.isSetNullCount && b.getNullCount != 0) {
+      // preserving sign, since direction of change is usually relevant
+      val aNullRatio = a.getNullCount.toDouble / a.getCount.toDouble
+      val bNullRatio = b.getNullCount.toDouble / b.getCount.toDouble
+      val change = (aNullRatio - bNullRatio) / bNullRatio
+      result.setNullRatioChangePercent(change * 100)
+    }
+
+    if (a.isSetInnerCount && b.isSetInnerCount && b.getInnerCount != 0.0) {
+      // preserving sign, since direction of change is usually relevant
+      val change = (a.getInnerCount - b.getInnerCount).toDouble / b.getInnerCount
+      result.setInnerCountChangePercent(change * 100)
+    }
+
+    if (a.isSetInnerNullCount && b.isSetInnerNullCount && b.getInnerNullCount != 0.0) {
+      // preserving sign, since direction of change is usually relevant
+      val aNullRatio = a.getInnerNullCount.toDouble / a.getInnerCount.toDouble
+      val bNullRatio = b.getInnerNullCount.toDouble / b.getInnerCount.toDouble
+      val change = (aNullRatio - bNullRatio) / bNullRatio
+      result.setInnerNullCountChangePercent(change * 100)
+    }
+
+    if (a.isSetLengthPercentiles && b.isSetLengthPercentiles) {
+      val drift = percentileDistance(toArray(a.getLengthPercentiles).map(_.toDouble),
+                                     toArray(b.getLengthPercentiles).map(_.toDouble),
+                                     metric)
+      result.setLengthPercentilesDrift(drift)
+    }
+
+    if (a.isSetStringLengthPercentiles && b.isSetStringLengthPercentiles) {
+      val drift = percentileDistance(toArray(a.getStringLengthPercentiles).map(_.toDouble),
+                                     toArray(b.getStringLengthPercentiles).map(_.toDouble),
+                                     metric)
+      result.setStringLengthPercentilesDrift(drift)
+    }


🛠️ Refactor suggestion

Refactor repeated percentage calculation pattern.

The code repeats similar percentage calculation patterns across multiple metrics. Consider extracting this logic into a reusable helper method and introducing a constant for the percentage multiplier.

+private val PERCENTAGE_MULTIPLIER = 100.0 + +private def calculatePercentageChange(newValue: Double, oldValue: Double): Double = { + require(oldValue != 0.0, "Old value cannot be zero") + ((newValue - oldValue) / oldValue) * PERCENTAGE_MULTIPLIER +} + +private def calculateRatioChange( + newNumerator: Double, + newDenominator: Double, + oldNumerator: Double, + oldDenominator: Double +): Double = { + val newRatio = newNumerator / newDenominator + val oldRatio = oldNumerator / oldDenominator + calculatePercentageChange(newRatio, oldRatio) +}

Then use these helper methods:

if (a.isSetCount && b.isSetCount && b.getCount != 0.0) { - val change = (a.getCount - b.getCount).toDouble / b.getCount - result.setCountChangePercent(change * 100) + result.setCountChangePercent(calculatePercentageChange(a.getCount, b.getCount)) } if (a.isSetNullCount && b.isSetNullCount && b.getNullCount != 0) { - val aNullRatio = a.getNullCount.toDouble / a.getCount.toDouble - val bNullRatio = b.getNullCount.toDouble / b.getCount.toDouble - val change = (aNullRatio - bNullRatio) / bNullRatio - result.setNullRatioChangePercent(change * 100) + result.setNullRatioChangePercent(calculateRatioChange( + a.getNullCount.toDouble, a.getCount.toDouble, + b.getNullCount.toDouble, b.getCount.toDouble + )) }

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2024-11-08T06:43:50Z

online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala

+  // a is after, b is before
+  def between(a: TileSummaries, b: TileSummaries, metric: DriftMetric = DriftMetric.JENSEN_SHANNON): TileDrift = {
+    val result = new TileDrift()
+
+    if (a.isSetPercentiles && b.isSetPercentiles) {
+      val drift = percentileDistance(toArray(a.getPercentiles), toArray(b.getPercentiles), metric)
+      result.setPercentilDrift(drift)
+    }
+
+    if (a.isSetHistogram && b.isSetHistogram) {
+      val drift = histogramDistance(a.getHistogram, b.getHistogram, metric)
+      result.setHistogramDrift(drift)
+    }
+
+    if (a.isSetCount && b.isSetCount && b.getCount != 0.0) {
+      // preserving sign, since direction of change is usually relevant
+      val change = (a.getCount - b.getCount).toDouble / b.getCount
+      result.setCountChangePercent(change * 100)
+    }
+
+    if (a.isSetNullCount && b.isSetNullCount && b.getNullCount != 0) {
+      // preserving sign, since direction of change is usually relevant
+      val aNullRatio = a.getNullCount.toDouble / a.getCount.toDouble
+      val bNullRatio = b.getNullCount.toDouble / b.getCount.toDouble
+      val change = (aNullRatio - bNullRatio) / bNullRatio
+      result.setNullRatioChangePercent(change * 100)
+    }
+
+    if (a.isSetInnerCount && b.isSetInnerCount && b.getInnerCount != 0.0) {
+      // preserving sign, since direction of change is usually relevant
+      val change = (a.getInnerCount - b.getInnerCount).toDouble / b.getInnerCount
+      result.setInnerCountChangePercent(change * 100)
+    }
+
+    if (a.isSetInnerNullCount && b.isSetInnerNullCount && b.getInnerNullCount != 0.0) {
+      // preserving sign, since direction of change is usually relevant
+      val aNullRatio = a.getInnerNullCount.toDouble / a.getInnerCount.toDouble
+      val bNullRatio = b.getInnerNullCount.toDouble / b.getInnerCount.toDouble
+      val change = (aNullRatio - bNullRatio) / bNullRatio
+      result.setInnerNullCountChangePercent(change * 100)
+    }
+
+    if (a.isSetLengthPercentiles && b.isSetLengthPercentiles) {
+      val drift = percentileDistance(toArray(a.getLengthPercentiles).map(_.toDouble),
+                                     toArray(b.getLengthPercentiles).map(_.toDouble),
+                                     metric)
+      result.setLengthPercentilesDrift(drift)
+    }
+
+    if (a.isSetStringLengthPercentiles && b.isSetStringLengthPercentiles) {
+      val drift = percentileDistance(toArray(a.getStringLengthPercentiles).map(_.toDouble),
+                                     toArray(b.getStringLengthPercentiles).map(_.toDouble),
+                                     metric)
+      result.setStringLengthPercentilesDrift(drift)
+    }
+
+    result
+  }


⚠️ Potential issue

Add input validation and error handling.

The method lacks comprehensive input validation. Consider adding checks for:

Null input parameters

Zero denominators in ratio calculations

Invalid metric types

def between(a: TileSummaries, b: TileSummaries, metric: DriftMetric = DriftMetric.JENSEN_SHANNON): TileDrift = { + require(a != null, "After TileSummaries cannot be null") + require(b != null, "Before TileSummaries cannot be null") + require(metric != null, "DriftMetric cannot be null") + val result = new TileDrift()

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2024-11-08T06:43:50Z

online/src/test/scala/ai/chronon/online/test/stats/DistanceMetricsTest.scala

+  type Histogram = java.util.Map[String, java.lang.Long]
+
+  def compareDistributions(meanA: Double,
+                           varianceA: Double,
+                           meanB: Double,
+                           varianceB: Double,
+                           breaks: Int = 20,
+                           debug: Boolean = false): Map[DriftMetric, (Double, Double)] = {
+
+    val aPercentiles = buildPercentiles(meanA, varianceA, breaks)
+    val bPercentiles = buildPercentiles(meanB, varianceB, breaks)
+
+    val aHistogram: Histogram = (0 to breaks)
+      .map { i =>
+        val value = java.lang.Long.valueOf((math.abs(aPercentiles(i)) * 100).toLong)
+        i.toString -> value
+      }
+      .toMap
+      .toJava
+
+    val bHistogram: Histogram = (0 to breaks)
+      .map { i =>
+        val value = java.lang.Long.valueOf((math.abs(bPercentiles(i)) * 100).toLong)
+        i.toString -> value
+      }
+      .toMap
+      .toJava
+
+    def calculateDrift(metric: DriftMetric): (Double, Double) = {
+      val pDrift = percentileDistance(aPercentiles, bPercentiles, metric, debug = debug)
+      val histoDrift = histogramDistance(aHistogram, bHistogram, metric)
+      (pDrift, histoDrift)
+    }
+
+    Map(
+      DriftMetric.JENSEN_SHANNON -> calculateDrift(DriftMetric.JENSEN_SHANNON),
+      DriftMetric.PSI -> calculateDrift(DriftMetric.PSI),
+      DriftMetric.HELLINGER -> calculateDrift(DriftMetric.HELLINGER)
+    )
+  }


🛠️ Refactor suggestion

Add input validation and optimize Java-Scala conversions.

Several improvements could enhance the robustness and performance:

Input Validation:

require(breaks > 0, "breaks must be positive") require(varianceA > 0 && varianceB > 0, "variance must be positive")

Performance:

Consider using Scala collections throughout instead of converting to Java

Cache the histogram calculations if used multiple times

coderabbitai · 2024-11-08T06:43:50Z

online/src/test/scala/ai/chronon/online/test/stats/DistanceMetricsTest.scala

+  test("Low drift - similar distributions") {
+    val drifts = compareDistributions(meanA = 100.0, varianceA = 225.0, meanB = 101.0, varianceB = 225.0)
+
+    // JSD assertions
+    val (jsdPercentile, jsdHisto) = drifts(DriftMetric.JENSEN_SHANNON)
+    jsdPercentile should be < 0.05
+    jsdHisto should be < 0.05
+
+    // Hellinger assertions
+    val (hellingerPercentile, hellingerHisto) = drifts(DriftMetric.HELLINGER)
+    hellingerPercentile should be < 0.05
+    hellingerHisto should be < 0.05
+  }
+
+  test("Moderate drift - slightly different distributions") {
+    val drifts = compareDistributions(meanA = 100.0, varianceA = 225.0, meanB = 105.0, varianceB = 256.0)
+
+    // JSD assertions
+    val (jsdPercentile, _) = drifts(DriftMetric.JENSEN_SHANNON)
+    jsdPercentile should (be >= 0.05 and be <= 0.15)
+
+    // Hellinger assertions
+    val (hellingerPercentile, _) = drifts(DriftMetric.HELLINGER)
+    hellingerPercentile should (be >= 0.05 and be <= 0.15)
+  }
+
+  test("Severe drift - different means") {
+    val drifts = compareDistributions(meanA = 100.0, varianceA = 225.0, meanB = 110.0, varianceB = 225.0)
+
+    // JSD assertions
+    val (jsdPercentile, _) = drifts(DriftMetric.JENSEN_SHANNON)
+    jsdPercentile should be > 0.15
+
+    // Hellinger assertions
+    val (hellingerPercentile, _) = drifts(DriftMetric.HELLINGER)
+    hellingerPercentile should be > 0.15
+  }
+
+  test("Severe drift - different variances") {
+    val drifts = compareDistributions(meanA = 100.0, varianceA = 225.0, meanB = 105.0, varianceB = 100.0)
+
+    // JSD assertions
+    val (jsdPercentile, _) = drifts(DriftMetric.JENSEN_SHANNON)
+    jsdPercentile should be > 0.15
+
+    // Hellinger assertions
+    val (hellingerPercentile, _) = drifts(DriftMetric.HELLINGER)
+    hellingerPercentile should be > 0.15
+  }


🛠️ Refactor suggestion

Enhance test coverage and document thresholds.

Current gaps in test coverage:

Missing test cases for:

Empty or single-value distributions

Extreme value scenarios

PSI metric assertions

Some tests ignore histogram results without explanation

Threshold values (0.05, 0.15) should be documented

Consider adding:

test("Edge cases - empty and single-value distributions") { // Add tests for edge cases } test("PSI metric validation") { val (psiPercentile, psiHisto) = drifts(DriftMetric.PSI) // Add assertions for PSI }

coderabbitai · 2024-11-08T06:43:50Z

api/thrift/api.thrift

+struct TileDrift {
+  1: optional double percentilDrift
+  2: optional double histogramDrift
+  3: optional double countChangePercent
+  4: optional double nullRatioChangePercent
+
+  5: optional double innerCountChangePercent // total of number of entries within all containers of this column
+  6: optional double innerNullCountChangePercent
+  7: optional double lengthPercentilesDrift
+
+  8: optional double stringLengthPercentilesDrift
+}


⚠️ Potential issue

Fix typo and standardize field naming.

Issues found:

Typo in field name: percentilDrift should be percentileDrift

Inconsistent naming pattern: some fields use 'Change' while others use 'Drift'

Apply this diff to fix the issues:

struct TileDrift { - 1: optional double percentilDrift + 1: optional double percentileDrift - 3: optional double countChangePercent - 4: optional double nullRatioChangePercent + 3: optional double countDrift + 4: optional double nullRatioDrift - 5: optional double innerCountChangePercent - 6: optional double innerNullCountChangePercent + 5: optional double innerCountDrift + 6: optional double innerNullCountDrift }

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2024-11-08T06:43:50Z

online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala

+    // debug only, remove before merging
+    val deltas = Array.ofDim[Double](aNormalized.length)
+
+    while (i < aNormalized.length) {
+      val ai = aNormalized(i)
+      val bi = bNormalized(i)
+      val delta = func(ai, bi)
+
+      // debug only remove before merging
+      deltas.update(i, delta)
+
+      result += delta
+      i += 1
+    }
+
+    if (debug) {
+      def printArr(arr: Array[Double]): String =
+        arr.map(v => f"$v%.3f").mkString(", ")
+      println(f"""
+           |aProjected : ${printArr(aProjected)}
+           |bProjected : ${printArr(bProjected)}
+           |aNormalized: ${printArr(aNormalized)}
+           |bNormalized: ${printArr(bNormalized)}
+           |deltas     : ${printArr(deltas)}
+           |result     : $result%.4f
+           |""".stripMargin)
+    }


⚠️ Potential issue

Remove debug code or implement proper logging before merging.

The percentileDistance function contains debug code with println statements and comments indicating removal before merging. It's important to remove this debug code or replace it with appropriate logging to maintain code cleanliness and prevent unintended console output in production.

piyush-zlai

Largely looks good. Flagging a couple of qs / clean up items

piyush-zlai · 2024-11-08T20:41:25Z

api/thrift/api.thrift

+}
+
+struct TileDrift {
+  1: optional double percentilDrift


nit: s/percentil/percentile ?

piyush-zlai · 2024-11-08T20:47:02Z

online/src/main/scala/ai/chronon/online/stats/DistanceMetrics.scala

+    var i = 0
+    var result = 0.0
+
+    // debug only, remove before merging


flagging to remove

piyush-zlai · 2024-11-08T20:52:44Z

online/src/test/scala/ai/chronon/online/test/stats/DistanceMetricsTest.scala

+
+    signX * math.sqrt(term1 - term2)
+  }
+  type Histogram = java.util.Map[String, java.lang.Long]


isn't this type already declared in the DistanceMetrics class?

piyush-zlai · 2024-11-08T20:53:58Z

spark/src/main/scala/ai/chronon/spark/LabelJoin.scala

@@ -97,10 +97,10 @@ class LabelJoin(joinConf: api.Join, tableUtils: TableUtils, labelDS: String) {
    }

    labelJoinConf.setups.foreach(tableUtils.sql)
-    val labelTable = compute(PartitionRange(leftStart, leftEnd), stepDays, Option(labelDS))
+    val labelDf = compute(PartitionRange(leftStart, leftEnd), stepDays, Option(labelDS))


can revert?

piyush-zlai · 2024-11-08T20:54:56Z

spark/src/main/scala/ai/chronon/spark/SparkSessionBuilder.scala

@@ -78,7 +78,7 @@ object SparkSessionBuilder {
      // use all threads - or the tests will be slow
        .master("local[*]")
        .config("spark.kryo.registrationRequired", s"${localWarehouseLocation.isEmpty}")
-        .config("spark.local.dir", s"/tmp/$userName/$name")
+        .config("spark.local.dir", s"/tmp/$userName/${name}_$warehouseId")


for test flakiness?

coderabbitai

Actionable comments posted: 40

🧹 Outside diff range and nitpick comments (41)

hub/app/model/Model.scala (1)

8-14: Consider performance implications of the proposed architecture

The proposed changes suggest a complex pipeline for processing time series data and calculating drift metrics. While the design appears sound, consider these architectural aspects:

Data Flow Optimization:

Consider batching the tile key generation and fetching operations

Implement streaming processing for large time series datasets

Design the metadata storage for efficient group-by lookups

Scalability Considerations:

Ensure the 1:n tile fetching can handle large-scale data efficiently

Consider partitioning strategies for the tile storage

Plan for horizontal scaling of the drift calculations

Integration Points:

Design clear boundaries between metadata, tile processing, and controller layers

Consider implementing a facade pattern to simplify the TimeSeriesController interface

Plan for versioning of the TileSummaries and TileDrift formats
hub/conf/routes (2)
9-12: Consider separating drift metrics into a dedicated endpoint.

The current design combines different metric types in a single endpoint, which could lead to:

Unclear parameter requirements for different metric types

Potential confusion about when algorithm is required

Mixed concerns in the controller implementation

Consider splitting into dedicated endpoints:
+ GET /api/v1/join/:name/metrics                    # for regular metrics
+ GET /api/v1/join/:name/drift                      # specifically for drift analysis
18-26: Review architectural approach for planned endpoints.

The current TODO and commented routes raise several concerns:

All routes point to the same controller method, which could become a complex catch-all handler

Some endpoints appear to have overlapping functionality (e.g., /values vs /distribution)

The granularity of endpoints aligns better with Single Responsibility Principle

Would you like help with:

Designing separate controller methods for each endpoint?

Creating an architectural document outlining the responsibilities of each endpoint?

Implementing proper parameter validation for each endpoint type?
spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala (2)
50-53: Consider adding error handling for KVStore initialization

While the switch to InMemoryKVStore is good, consider adding error handling for the table creation operation.
 val kvStore: () => KVStore = () => {
   val result = OnlineUtils.buildInMemoryKVStore(namespace)
-  result.create(Constants.DriftStatsTable, props)
+  try {
+    result.create(Constants.DriftStatsTable, props)
+  } catch {
+    case e: Exception => 
+      throw new RuntimeException(s"Failed to create DriftStatsTable: ${e.getMessage}", e)
+  }
   result
 }
Line range hint 26-110: Consider restructuring tests using ScalaTest fixtures

The test class could benefit from using ScalaTest fixtures to manage test resources and cleanup. This would provide a more structured approach to handling KVStore initialization and cleanup.

Consider refactoring to use BeforeAndAfterEach:
class DriftTest extends AnyFlatSpec with Matchers with BeforeAndAfterEach {
  private var kvStore: KVStore = _
  
  override def beforeEach(): Unit = {
    super.beforeEach()
    kvStore = InMemoryKvStore.build(namespace, () => null)
    kvStore.create(Constants.MetadataDataset)
    kvStore.create(Constants.DriftStatsTable)
  }
  
  override def afterEach(): Unit = {
    if (kvStore != null) {
      kvStore.drop(Constants.MetadataDataset)
      kvStore.drop(Constants.DriftStatsTable)
    }
    super.afterEach()
  }
  // ... existing test cases ...
}
online/src/test/scala/ai/chronon/online/test/stats/DriftMetricsTest.scala (2)
28-37: Add documentation for the inverse error function implementation.

The inverseErf method uses an approximation but lacks documentation about its accuracy and source.

Consider adding a documentation comment:
+  /**
+   * Approximates the inverse error function using the method described in
+   * [Add citation or reference to the approximation method]
+   *
+   * @param x Input value in range [-1, 1]
+   * @return Approximated inverse error function value
+   * @note Approximation accuracy: [Add details about accuracy]
+   */
   def inverseErf(x: Double): Double = {
+    require(x >= -1 && x <= 1, "Input must be in range [-1, 1]")
     val a = 0.147
79-127: Enhance test coverage with additional scenarios.

The test suite would benefit from additional test cases:

Edge cases (e.g., very small/large variances)

Error scenarios (e.g., negative variance)

Documentation for threshold choices (0.05, 0.15)

Consider adding these test cases:
test("Edge case - very small variance") {
  val drifts = compareDistributions(meanA = 100.0, varianceA = 0.001, meanB = 100.0, varianceB = 0.001)
  // Add assertions
}

test("Should fail with negative variance") {
  assertThrows[IllegalArgumentException] {
    compareDistributions(meanA = 100.0, varianceA = -1.0, meanB = 100.0, varianceB = 1.0)
  }
}
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala (1)

50-53: Add test coverage for timestamp edge cases.

The timestamp handling involves several edge cases that should be tested:

Events exactly at startTsMillis (should be excluded)

Events at queryTime (should be included)

Milliseconds to microseconds conversion accuracy

Would you like me to help create test cases for these scenarios? I can generate a test suite that covers these edge cases.
hub/app/store/DynamoDBMonitoringStore.scala (1)
Line range hint 93-143: Enhance error handling and simplify configuration loading logic.

The current implementation has several areas for improvement:

Silent failures with empty LoadedConfs()

Repetitive configuration construction

Lack of structured error reporting

Consider refactoring to:

Create an enumeration for configuration types:
sealed trait ConfigType
object ConfigType {
  case object Join extends ConfigType
  case object GroupBy extends ConfigType
  case object StagingQuery extends ConfigType
  case object Model extends ConfigType
}
Simplify the pattern matching:
def parseConfigType(key: String): Option[ConfigType] = {
  if (key.contains("joins/")) Some(ConfigType.Join)
  else if (key.contains("group_bys/")) Some(ConfigType.GroupBy)
  else if (key.contains("staging_queries/")) Some(ConfigType.StagingQuery)
  else if (key.contains("models/")) Some(ConfigType.Model)
  else None
}

private def makeLoadedConfs(response: ListResponse): LoadedConfs = {
  response.values
    .map { seqAB =>
      val seqKVStrings = seqAB.map(kv =>
        (new String(kv.keyBytes, StandardCharsets.UTF_8), 
         new String(kv.valueBytes, StandardCharsets.UTF_8)))
      
      val (successes, failures) = seqKVStrings.map { case (key, value) =>
        parseConfigType(key).map { configType =>
          try {
            configType match {
              case ConfigType.Join => 
                Right(LoadedConfs(joins = Seq(getConf[api.Join](value))))
              case ConfigType.GroupBy => 
                Right(LoadedConfs(groupBys = Seq(getConf[api.GroupBy](value))))
              case ConfigType.StagingQuery => 
                Right(LoadedConfs(stagingQueries = Seq(getConf[api.StagingQuery](value))))
              case ConfigType.Model => 
                Right(LoadedConfs(models = Seq(getConf[api.Model](value))))
            }
          } catch {
            case e: Exception =>
              Left(s"Failed to parse $configType: ${e.getMessage}")
          }
        }.getOrElse(Left(s"Unknown configuration type for key: $key"))
      }.partition(_.isRight)

      failures.foreach {
        case Left(error) => logger.error(error)
        // Add metrics here to track parsing failures
      }

      successes.collect { case Right(conf) => conf }
        .foldLeft(LoadedConfs()) { (acc, curr) =>
          LoadedConfs(
            joins = acc.joins ++ curr.joins,
            groupBys = acc.groupBys ++ curr.groupBys,
            stagingQueries = acc.stagingQueries ++ curr.stagingQueries,
            models = acc.models ++ curr.models
          )
        }
    }
    .getOrElse(LoadedConfs())
}
This refactoring:

Provides better error tracking and reporting

Simplifies the configuration loading logic

Makes it easier to add new configuration types

Separates concerns between parsing and accumulation

Consider adding metrics to track parsing failures and configuration loading performance:
Metrics.counter("config_parse_failures_total").increment()
Metrics.timer("config_loading_duration").record(duration)
spark/src/test/scala/ai/chronon/spark/test/InMemoryKvStore.scala (1)

Line range hint 45-120: Add test coverage for drift metrics scenarios.

Since this PR implements drift metrics functionality, please ensure to add test cases that cover:

Timestamp-based version filtering with various time ranges

Edge cases in drift calculations

Concurrent access patterns with drift metrics

This aligns with the PR objectives of adding unit tests.

Would you like help in generating comprehensive test cases for drift metrics scenarios?
spark/src/test/scala/ai/chronon/spark/test/OnlineUtils.scala (2)
Line range hint 142-156: Improve robustness and documentation of the new streaming method.

Several improvements are recommended:

Replace the hardcoded sleep duration with a configurable parameter

Add comprehensive documentation explaining:

The method's purpose and behavior

Why async handling is needed

The differences from putStreaming

Consider using a more robust approach than Thread.sleep
+  /**
+   * Handles streaming data processing using JoinSourceRunner.
+   * This method is intended to replace putStreaming (see TODO below).
+   *
+   * @param originalGroupByConf The original GroupBy configuration to process
+   * @param ds The dataset identifier
+   * @param namespace The namespace for the operation
+   * @param kvStoreFunc Function to generate KVStore instances
+   * @param debug Enable debug mode
+   * @param asyncWaitMs Time to wait for async operations (default: 5000)
+   */
   def putStreamingNew(originalGroupByConf: api.GroupBy,
                       ds: String,
                       namespace: String,
                       kvStoreFunc: () => KVStore,
-                      debug: Boolean)(implicit session: SparkSession): Unit = {
+                      debug: Boolean,
+                      asyncWaitMs: Long = 5000)(implicit session: SparkSession): Unit = {
     implicit val mockApi = new MockApi(kvStoreFunc, namespace)
     val groupByConf = originalGroupByConf.deepCopy()
     val source = groupByConf.streamingSource.get
     mutateTopicWithDs(source, ds)
     val groupByStreaming = new JoinSourceRunner(groupByConf, Map.empty, debug = debug, lagMillis = 0)
     val query = groupByStreaming.chainedStreamingQuery.trigger(Trigger.Once()).start()
-    // drain work scheduled as futures over the executioncontext
-    Thread.sleep(5000)
-    // there is async stuff under the hood of chained streaming query
+    // Wait for async operations in the chained streaming query to complete
+    Thread.sleep(asyncWaitMs)
     query.awaitTermination()
   }
Would you like me to create a GitHub issue to track the deprecation of putStreaming?

Line range hint 153-171: Document the conditional streaming logic and address the TODO.

The integration of putStreamingNew looks correct, but there are two areas for improvement:

The conditional logic between putStreamingNew and putStreaming should be documented

The TODO about dropDsOnWrite needs to be addressed

Consider adding a comment explaining the streaming method selection:
     if (groupByConf.inferredAccuracy == Accuracy.TEMPORAL && groupByConf.streamingSource.isDefined) {
       val streamingSource = groupByConf.streamingSource.get
       inMemoryKvStore.create(groupByConf.streamingDataset)
+      // Use putStreamingNew for join sources, fallback to putStreaming for other cases
       if (streamingSource.isSetJoinSource) {
Would you like me to help investigate and document the dropDsOnWrite quirk mentioned in the TODO?
online/src/main/scala/ai/chronon/online/MetadataStore.scala (1)
Line range hint 231-244: Consider enhancing the create method with props support

Based on the learnings from PR #47, the KVStore trait supports a create method with props, but this implementation doesn't utilize it. Consider whether supporting props would be beneficial for future extensibility.
-def create(dataset: String): Unit = {
+def create(dataset: String, props: Map[String, Any] = Map.empty): Unit = {
   try {
     logger.info(s"Creating dataset: $dataset")
-    kvStore.create(dataset)
+    kvStore.create(dataset, props)
     logger.info(s"Successfully created dataset: $dataset")
   } catch {
     case e: Exception =>
spark/src/test/scala/ai/chronon/spark/test/FetchStatsTest.scala (2)
Line range hint 186-189: Enhance drift metrics test coverage

The drift metrics test section needs improvement:

It lacks assertions to verify the expected behavior

Only logs the response without validating the content

Missing test cases for edge cases and error scenarios

Consider adding comprehensive test cases:
// Example structure for enhanced drift metrics testing
val driftRequest = StatsRequest(joinConf.metaData.nameToFilePath + "/drift", None, None)
val fetchedDriftSeries = fetchStatsSeries(driftRequest, mockApi)

// Add assertions
assert(fetchedDriftSeries.values.isDefined, "Drift metrics response should contain values")
val driftValues = fetchedDriftSeries.values.get

// Verify specific metrics
assert(driftValues.contains("drift_score"), "Response should contain drift score")
assert(driftValues("drift_score").asInstanceOf[Double] >= 0.0, "Drift score should be non-negative")

// Test edge cases
val emptyDriftRequest = StatsRequest(joinConf.metaData.nameToFilePath + "/drift", Some(0L), Some(0L))
val emptyDriftSeries = fetchStatsSeries(emptyDriftRequest, mockApi)
assert(emptyDriftSeries.values.isEmpty, "Empty time range should return no drift metrics")
Line range hint 67-189: Improve test maintainability and readability

The test method testFetchStats is quite long (120+ lines) and handles multiple concerns. Consider breaking it down into smaller, focused test methods:

Setup methods for data generation

Separate test methods for each type of metric

Helper methods for common assertions

Example structure:
class FetchStatsTest extends TestCase {
  // Setup methods
  private def setupTestData(): (String, String) = {
    // Extract data setup logic here
    (itemQueriesTable, viewsTable)
  }
  
  private def setupJoinConfiguration(itemQueriesTable: String, viewsTable: String): Builders.Join = {
    // Extract join configuration logic here
  }

  // Individual test methods
  def testBasicStats(): Unit = {
    val (itemQueriesTable, viewsTable) = setupTestData()
    val joinConf = setupJoinConfiguration(itemQueriesTable, viewsTable)
    // Test basic stats fetching
  }

  def testLogStats(): Unit = {
    // Test log stats fetching
  }

  def testConsistencyMetrics(): Unit = {
    // Test consistency metrics
  }

  def testDriftMetrics(): Unit = {
    // Comprehensive drift metrics testing
  }
}
api/src/main/scala/ai/chronon/api/Builders.scala (1)

270-271: Consider grouping related parameters

The historicalBackfill parameter appears to be related to other historical/backfill parameters. Consider moving it next to related parameters for better organization and readability.

Consider reordering parameters to group related ones together, for example moving historicalBackfill near backfillStartDate in the GroupBy.apply method.
online/src/test/scala/ai/chronon/online/test/stats/PivotUtilsTest.scala (3)
13-168: Consider enhancing test coverage with edge cases

While the current test coverage is comprehensive, consider adding these scenarios:

Extreme values in percentiles (e.g., Double.MAX_VALUE, Double.MIN_VALUE)

Performance test with larger datasets

Documentation for complex test scenarios (e.g., the histogram merging logic)

170-315: Consider adding property-based tests for drift metrics

The drift tests are thorough, but consider these enhancements:

Add property-based tests using ScalaCheck to verify drift calculations

Test boundary conditions (e.g., 100% drift, negative drifts if applicable)

Include examples with known drift calculations to serve as documentation

Example property-based test:
import org.scalacheck.Prop.forAll
import org.scalacheck.Gen

"pivot_drift" should "maintain value integrity under any valid input" in {
  val driftGen = for {
    percentilDrift <- Gen.choose(0.0, 1.0)
    histogramDrift <- Gen.choose(0.0, 1.0)
    timestamp <- Gen.posNum[Long]
  } yield {
    val drift = new TileDrift()
    drift.setPercentilDrift(percentilDrift)
    drift.setHistogramDrift(histogramDrift)
    (drift, timestamp)
  }

  forAll(Gen.listOf(driftGen)) { drifts =>
    val result = pivot(drifts.toArray)
    // Property: Input values should be preserved in output series
    drifts.sortBy(_._2).map(_._1.getPercentilDrift) ==
      result.getPercentilDriftSeries.asScala.toList
  }
}
1-315: Consider improving test maintainability

While the tests are well-structured, consider these improvements for better maintainability:

Extract common test data setup into helper methods or fixtures

Use more descriptive test names that follow the "given/when/then" pattern

Add comments explaining the expected behavior for complex scenarios

Example helper method:
private def createDriftWithValues(
    percentilDrift: Double,
    histogramDrift: Double,
    countChange: Double
): TileDrift = {
  val drift = new TileDrift()
  drift.setPercentilDrift(percentilDrift)
  drift.setHistogramDrift(histogramDrift)
  drift.setCountChangePercent(countChange)
  drift
}
spark/src/main/scala/ai/chronon/spark/stats/drift/Expressions.scala (2)
85-86: Consider thread safety for mutable map operations

The use of mutable state in columnTileSummaries could lead to race conditions if this code is executed in a parallel context. Consider:

Using immutable data structures

Adding synchronization if parallel execution is possible

114-141: Consider simplifying Java interop code

While the Java interop code is functional, it could be simplified using Scala's built-in converters:
- row.getSeq[Double](index).map(lang.Double.valueOf).toJava
+ row.getSeq[Double](index).asJava
The pattern matching is comprehensive and type-safe, handling all metric types appropriately.
api/thrift/api.thrift (4)
Line range hint 253-267: Add documentation for the TileSummary struct and its fields.

The struct would benefit from documentation explaining:

The purpose and usage of each field

The format and meaning of percentile values

The relationship between container types and inner counts

Add documentation like this:
+/**
+ * Statistical summary of data within a time tile
+ */
 struct TileSummary {
+  /** Percentile values for numeric data */
+  /** Distribution of categorical values */
   // ... add docs for remaining fields
 }
268-284: Add formal documentation and consider type safety improvements.

The inline comment about pivot operation should be formalized into proper documentation

Consider adding constraints or validation for list lengths matching timestamp count

Add documentation like this:
+/**
+ * Time series of statistical summaries, created by pivoting (TileSummary, timestamp) tuples
+ * All lists in this struct must have the same length as the timestamps list
+ */
 struct TileSummarySeries {
339-346: Add documentation for the new DriftSpec fields.

The new fields would benefit from more detailed documentation explaining:

The relationship between tileSize and lookbackWindows

The impact of different window sizes on drift detection

Add documentation like this:
+  /** Size of the time window for computing tile summaries */

+  /** Windows sizes for comparing current tile with historical data */

+  /** Default metric for computing drift between tiles */
380-382: Add documentation for the new driftSpec field.

Add documentation like this:
+    /** Configuration for computing and monitoring data drift */
spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala (3)
30-36: Enhance schema validation error messages for better clarity

The current require statements provide basic error messages which may not give sufficient context when validation fails.

Improve the error messages to include the actual data types encountered, aiding in quicker debugging.

Apply this diff:
 require(missingColumns.isEmpty, s"Missing required columns: ${missingColumns.mkString(", ")}")

-require(summaryDF.schema(keyIndex).dataType == types.BinaryType, "keyBytes must be BinaryType")
-require(summaryDF.schema(valueIndex).dataType == types.BinaryType, "valueBytes must be BinaryType")
-require(summaryDF.schema(timestampIndex).dataType == types.LongType, "timestamp must be LongType")
+require(
+  summaryDF.schema(keyIndex).dataType == types.BinaryType,
+  s"Column 'keyBytes' must be BinaryType but found ${summaryDF.schema(keyIndex).dataType}"
+)
+require(
+  summaryDF.schema(valueIndex).dataType == types.BinaryType,
+  s"Column 'valueBytes' must be BinaryType but found ${summaryDF.schema(valueIndex).dataType}"
+)
+require(
+  summaryDF.schema(timestampIndex).dataType == types.LongType,
+  s"Column 'timestamp' must be LongType but found ${summaryDF.schema(timestampIndex).dataType}"
+)
16-22: Consider initializing KVStore outside of the partition to avoid repeated object creation

Currently, kvStore is initialized within each partition, which could lead to multiple unnecessary initializations.

If KVStore is thread-safe and can be shared across partitions, consider initializing it once outside the foreachPartition to improve efficiency.

Apply this diff:
 def run(): Unit = {
   // Validate schema
   // ... schema validation code ...

+  val kvStore: KVStore = api.genKvStore
+
   summaryDF.rdd.foreachPartition((rows: Iterator[Row]) => {
-    val kvStore: KVStore = api.genKvStore

     def toPutRequest(row: Row): PutRequest = {
       // ... toPutRequest implementation ...
     }

     // ... rest of the code ...
   })
 }
13-14: Consolidate imports for better code organization

The imports for scala.util.Failure and scala.util.Success can be combined into a single statement for brevity.

Apply this diff:
-import scala.util.Failure
-import scala.util.Success
+import scala.util.{Failure, Success}
online/src/main/scala/ai/chronon/online/stats/DriftMetrics.scala (4)
124-125: Replace magic numbers with named constants for clarity

The small value 1e-5 used to adjust zero probabilities in the psiTerm method can be defined as a named constant to improve code readability and maintainability.

Consider introducing a named constant:
private val EPSILON = 1e-5
And update the code:
 val aFixed = if (a == 0.0) 1e-5 else a
 val bFixed = if (b == 0.0) 1e-5 else b
+val aFixed = if (a == 0.0) EPSILON else a
+val bFixed = if (b == 0.0) EPSILON else b
This makes it easier to adjust the epsilon value in the future and documents its purpose.

95-95: Optimize the normalize method by normalizing in place

The TODO comment suggests optimizing the normalize method by performing normalization in place if it's a performance hotspot. Implementing this can reduce memory allocations and improve performance.

Here's how you might modify the method to normalize in place:
 private def normalize(arr: Array[Double]): Array[Double] = {
     val sum = arr.sum
+    if (sum == 0.0) {
+        // Handle zero-sum case by returning an array of zeros
+        return Array.fill(arr.length)(0.0)
+    }
-    val result = Array.ofDim[Double](arr.length)
     var i = 0
     while (i < arr.length) {
-        result.update(i, arr(i) / sum)
+        arr.update(i, arr(i) / sum)
         i += 1
     }
-    result
+    arr
 }
This modifies the input array directly, which can improve performance by avoiding the creation of a new array.

130-135: Ensure exhaustive pattern matching in termFunc method

The termFunc method matches on DriftMetric, but does not provide a default case. If DriftMetric can have new values in the future, consider adding a default case to handle unexpected values.

Add a default case to prevent potential MatchError exceptions:
     d match {
         case DriftMetric.PSI            => psiTerm
         case DriftMetric.HELLINGER      => hellingerTerm
         case DriftMetric.JENSEN_SHANNON => jsdTerm
+        case _                          => throw new IllegalArgumentException(s"Unsupported DriftMetric: $d")
     }
Alternatively, if DriftMetric is a sealed trait or enum and all cases are covered, you can ignore this suggestion.

146-151: Ensure exhaustive pattern matching in thresholds method

Similar to termFunc, the thresholds method matches on DriftMetric without a default case. Ensure that all possible values are handled or add a default case.

Consider adding a default case:
     d match {
         case DriftMetric.JENSEN_SHANNON => Thresholds(0.05, 0.15)
         case DriftMetric.HELLINGER      => Thresholds(0.05, 0.15)
         case DriftMetric.PSI            => Thresholds(0.1, 0.2)
+        case _                          => throw new IllegalArgumentException(s"Unsupported DriftMetric: $d")
     }
This prevents potential runtime exceptions if new DriftMetric values are introduced.
online/src/main/scala/ai/chronon/online/stats/DriftStore.scala (1)

121-122: Optimize data fetching to prevent over-fetching in getDriftSeries

The TODO comment suggests potential over-fetching when lookBack is significantly larger than the range between startMs and endMs. This can lead to unnecessary resource consumption and decreased performance.

Would you like assistance in implementing an optimization to adjust the fetched data range based on the lookBack window and the actual time range (startMs to endMs)? This could involve calculating a more precise fetch window or merging summaries efficiently.
online/src/main/scala/ai/chronon/online/stats/PivotUtils.scala (2)
77-79: Consider avoiding custom operators for better readability

The custom pipe operator |> defined in PipeOps may reduce code readability, especially for developers unfamiliar with this pattern. Using standard method chaining or existing operators can improve maintainability and team comprehension.

Remove the custom operator and adjust the chained calls:
-  implicit class PipeOps[A](val value: A) extends AnyVal {
-    def |>[B](f: A => B): B = f(value)
-  }
And update usages, for example:
-  .setPercentiles(iterator(_.getPercentiles).toArray |> transpose)
+  .setPercentiles(transpose(iterator(_.getPercentiles).toArray))
16-18: Update or remove outdated TODO comment

The TODO comment on line 18 suggests that TileDrift => TileDriftSeries implementation is pending. Since pivot(driftsWithTimestamps: Array[(TileDrift, Long)]) is now implemented, consider updating or removing this comment to reflect the current state of the code.

Apply this diff to remove the outdated comment:
-  // TODO: TileDrift => TileDriftSeries
online/src/main/scala/ai/chronon/online/stats/Display.scala (3)
32-36: Use a predefined color palette for consistent chart series colors

The current implementation uses randomly generated colors for each series, which might result in similar or hard-to-distinguish colors. Using a predefined color palette ensures each series has a distinct and recognizable color.

Apply this diff to use a predefined color palette:
+        val colors = Seq("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "#8c564b")
-        val seriesConfig = chart.seriesList.map(s => s"""{
+        val seriesConfig = chart.seriesList.zipWithIndex.map { case (s, i) => s"""{
             |  label: "${s.name}",
-            |  stroke: "rgb(${scala.util.Random.nextInt(255)}, ${scala.util.Random.nextInt(255)}, ${scala.util.Random.nextInt(255)})"
+            |  stroke: "${colors(i % colors.length)}"
             |
-            |}""".stripMargin).mkString(",\n")
+            |}""".stripMargin }.mkString(",\n")
174-175: Remove commented-out code to improve readability

The commented-out line // page.render is unnecessary and can be removed to keep the codebase clean and maintainable.

Apply this diff to remove the commented code:
-    //        page.render
182-183: Bind the server to localhost for enhanced security

Currently, the server listens on all network interfaces (0.0.0.0). Since this tool is intended for debugging purposes, consider binding the server to localhost (127.0.0.1) to prevent external access.

Apply this diff to bind the server to localhost:
 override def host: String = 
-  "0.0.0.0"
+  "127.0.0.1"
 override def port: Int = portVal
api/src/main/scala/ai/chronon/api/Extensions.scala (1)
163-169: Simplify the driftTileSize method using flatMap.

The driftTileSize method can be simplified by utilizing flatMap to avoid nested Options and make the code more concise.

Here's the suggested refactor:
-def driftTileSize: Option[Window] = {
-  Option(metaData.getDriftSpec) match {
-    case Some(driftSpec) =>
-      Option(driftSpec.getTileSize).orElse(Some(Constants.DefaultDriftTileSize))
-    case None => None
-  }
+def driftTileSize: Option[Window] = 
+  Option(metaData.getDriftSpec)
+    .flatMap(driftSpec => Option(driftSpec.getTileSize))
+    .orElse(Some(Constants.DefaultDriftTileSize))
spark/src/test/scala/ai/chronon/spark/test/stats/drift/PrepareData.scala (2)
119-124: Prefer Scala collections over Java collections for consistency

Consider using Scala's collections and converters instead of Java's ArrayList to create the windows list. This promotes idiomatic Scala code and enhances readability.

Apply this diff to use Scala's List and convert it to a Java list:
+import scala.collection.JavaConverters._
 val driftSpec = new DriftSpec()
 driftSpec.setTileSize(new Window(30, TimeUnit.MINUTES))
 driftSpec.setDriftMetric(DriftMetric.JENSEN_SHANNON)
-val windows = new java.util.ArrayList[Window]()
-windows.add(new Window(30, TimeUnit.MINUTES))
+val windows = List(new Window(30, TimeUnit.MINUTES))
 driftSpec.setLookbackWindows(windows.asJava)
294-294: Avoid code duplication by reusing generateNonOverlappingWindows

The anomalyWindows variable is generated in both generateTimeseriesWithAnomalies and generateFraudSampleData. Consider refactoring to avoid duplication and maintain consistency.

You might extract the anomaly window generation into a shared method or pass anomalyWindows as a parameter to reuse the logic.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between acee9b2 and 4cc9851.

📒 Files selected for processing (43)

.github/workflows/test_scala_and_python.yaml (2 hunks)
api/src/main/scala/ai/chronon/api/Builders.scala (2 hunks)
api/src/main/scala/ai/chronon/api/Constants.scala (2 hunks)
api/src/main/scala/ai/chronon/api/Extensions.scala (2 hunks)
api/src/main/scala/ai/chronon/api/ThriftJsonCodec.scala (2 hunks)
api/thrift/api.thrift (4 hunks)
build.sbt (1 hunks)
cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (2 hunks)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala (1 hunks)
hub/app/controllers/TimeSeriesController.scala (4 hunks)
hub/app/model/Model.scala (1 hunks)
hub/app/store/DynamoDBMonitoringStore.scala (1 hunks)
hub/conf/routes (1 hunks)
online/src/main/scala/ai/chronon/online/Api.scala (3 hunks)
online/src/main/scala/ai/chronon/online/Fetcher.scala (1 hunks)
online/src/main/scala/ai/chronon/online/FetcherBase.scala (2 hunks)
online/src/main/scala/ai/chronon/online/MetadataStore.scala (3 hunks)
online/src/main/scala/ai/chronon/online/stats/AssignIntervals.scala (1 hunks)
online/src/main/scala/ai/chronon/online/stats/Display.scala (1 hunks)
online/src/main/scala/ai/chronon/online/stats/DriftMetrics.scala (1 hunks)
online/src/main/scala/ai/chronon/online/stats/DriftStore.scala (1 hunks)
online/src/main/scala/ai/chronon/online/stats/PivotUtils.scala (1 hunks)
online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala (1 hunks)
online/src/test/scala/ai/chronon/online/test/stats/AssignIntervalsTest.scala (1 hunks)
online/src/test/scala/ai/chronon/online/test/stats/DriftMetricsTest.scala (1 hunks)
online/src/test/scala/ai/chronon/online/test/stats/PivotUtilsTest.scala (1 hunks)
spark/src/main/scala/ai/chronon/spark/LabelJoin.scala (2 hunks)
spark/src/main/scala/ai/chronon/spark/SparkSessionBuilder.scala (1 hunks)
spark/src/main/scala/ai/chronon/spark/stats/drift/DistanceMetrics.scala (0 hunks)
spark/src/main/scala/ai/chronon/spark/stats/drift/Expressions.scala (4 hunks)
spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala (1 hunks)
spark/src/test/scala/ai/chronon/spark/test/ChainingFetcherTest.scala (2 hunks)
spark/src/test/scala/ai/chronon/spark/test/ExternalSourcesTest.scala (2 hunks)
spark/src/test/scala/ai/chronon/spark/test/FetchStatsTest.scala (2 hunks)
spark/src/test/scala/ai/chronon/spark/test/FetcherTest.scala (5 hunks)
spark/src/test/scala/ai/chronon/spark/test/InMemoryKvStore.scala (3 hunks)
spark/src/test/scala/ai/chronon/spark/test/OnlineUtils.scala (1 hunks)
spark/src/test/scala/ai/chronon/spark/test/SchemaEvolutionTest.scala (2 hunks)
spark/src/test/scala/ai/chronon/spark/test/StreamingTest.scala (2 hunks)
spark/src/test/scala/ai/chronon/spark/test/bootstrap/DerivationTest.scala (1 hunks)
spark/src/test/scala/ai/chronon/spark/test/bootstrap/LogBootstrapTest.scala (1 hunks)
spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala (3 hunks)
spark/src/test/scala/ai/chronon/spark/test/stats/drift/PrepareData.scala (5 hunks)

💤 Files with no reviewable changes (1)

spark/src/main/scala/ai/chronon/spark/stats/drift/DistanceMetrics.scala

✅ Files skipped from review due to trivial changes (1)

spark/src/test/scala/ai/chronon/spark/test/bootstrap/LogBootstrapTest.scala

🚧 Files skipped from review as they are similar to previous changes (6)

api/src/main/scala/ai/chronon/api/ThriftJsonCodec.scala
online/src/main/scala/ai/chronon/online/stats/AssignIntervals.scala
online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala
online/src/test/scala/ai/chronon/online/test/stats/AssignIntervalsTest.scala
spark/src/main/scala/ai/chronon/spark/LabelJoin.scala
spark/src/main/scala/ai/chronon/spark/SparkSessionBuilder.scala

🧰 Additional context used

📓 Learnings (3)

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (6)

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:245-260
Timestamp: 2024-11-12T09:38:33.532Z
Learning: In `DynamoDBKVStoreImpl.scala`, refactoring methods like `extractTimedValues` and `extractListValues` to eliminate code duplication is discouraged if it would make the code more convoluted.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:245-260
Timestamp: 2024-11-12T09:38:33.532Z
Learning: In `DynamoDBKVStoreImpl.scala`, refactoring methods like `extractTimedValues` and `extractListValues` to eliminate code duplication is discouraged if it would make the code more convoluted.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:167-229
Timestamp: 2024-11-12T09:38:33.532Z
Learning: Errors and logging are handled at a granular level in `handleDynamoDbOperation` by passing the metrics context and updating relevant metric values based on the invoking operation.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:167-229
Timestamp: 2024-11-12T09:38:33.532Z
Learning: Leveraging DynamoDB batch writes for performance optimization in `multiPut` will be addressed in a future review.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/test/scala/ai/chronon/integrations/aws/DynamoDBKVStoreTest.scala:175-175
Timestamp: 2024-11-12T09:38:33.532Z
Learning: Hardcoding future timestamps in tests within `DynamoDBKVStoreTest.scala` is acceptable when data is generated and queried within the same time range, ensuring the tests remain valid over time.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/test/scala/ai/chronon/integrations/aws/DynamoDBKVStoreTest.scala:175-175
Timestamp: 2024-11-12T09:38:33.532Z
Learning: Hardcoding future timestamps in tests within `DynamoDBKVStoreTest.scala` is acceptable when data is generated and queried within the same time range, ensuring the tests remain valid over time.

online/src/main/scala/ai/chronon/online/MetadataStore.scala (1)

Learnt from: chewy-zlai
PR: zipline-ai/chronon#47
File: online/src/main/scala/ai/chronon/online/MetadataStore.scala:232-0
Timestamp: 2024-11-12T09:38:33.532Z
Learning: In the `KVStore` trait located at `online/src/main/scala/ai/chronon/online/KVStore.scala`, there are two `create` methods: `def create(dataset: String): Unit` and `def create(dataset: String, props: Map[String, Any]): Unit`. The version with `props` ignores the `props` parameter, and the simpler version without `props` is appropriate when `props` are not needed.

spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala (3)

Learnt from: chewy-zlai
PR: zipline-ai/chronon#50
File: spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala:37-40
Timestamp: 2024-11-12T09:38:33.532Z
Learning: Avoid using `Await.result` in production code; prefer handling `Future`s asynchronously when possible to prevent blocking.

Learnt from: chewy-zlai
PR: zipline-ai/chronon#62
File: spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala:9-10
Timestamp: 2024-11-12T09:38:33.532Z
Learning: In Spark applications, when defining serializable classes, passing an implicit `ExecutionContext` parameter can cause serialization issues. In such cases, it's acceptable to use `scala.concurrent.ExecutionContext.Implicits.global`.

Learnt from: nikhil-zlai
PR: zipline-ai/chronon#50
File: spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala:19-47
Timestamp: 2024-11-12T09:38:33.532Z
Learning: In Scala, the `grouped` method on collections returns an iterator, allowing for efficient batch processing without accumulating all records in memory.

🪛 actionlint

.github/workflows/test_scala_and_python.yaml

52-52: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-14.0", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-13.0", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "macos-12.0", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🔇 Additional comments (48)

hub/app/model/Model.scala (2)

12-14: Review the tile processing pipeline design

The proposed tile processing pipeline introduces several critical components:

Fetching TileSummaries from tileKey (1:n relationship)
Comparing TileSummaries to generate TileDrift
Integration with TimeSeriesController using column-to-tileDrift mapping

Consider the following architectural recommendations:

Ensure the TileSummaries fetching is optimized for the 1:n relationship
Consider caching strategies for frequently accessed tiles
Design the TileDrift comparison to be parallelizable
Plan for error handling in the TimeSeriesController integration

This aligns with the existing metric types (Drift, Skew) and algorithms (PSI, KL) but may require additional drift algorithms based on the comparison requirements.

8-11: Verify the metadata upload and column listing functionality

The proposed workflow suggests significant architectural changes:

Mapping join configurations to group-by columns
New methods for fetching join configurations
Generation of tile keys from columns and time parameters

These changes will require careful integration with existing data structures. Please ensure:

The GroupBy and Join case classes can support the proposed metadata mapping
The column listing functionality aligns with the existing joinFeatures and features fields

hub/conf/routes (2)

6-8: Add parameter documentation and validation.

The new model timeseries endpoint lacks documentation about:

Valid ranges for startTs and endTs
Supported values for offset and algorithm
Response format and possible error cases

13-16: Clarify granularity parameter and ensure consistency with join endpoints.

The feature endpoint introduces a granularity parameter not present in join endpoints:

Its purpose and valid values should be documented
Consider whether this parameter should also apply to join endpoints for consistency

api/src/main/scala/ai/chronon/api/Constants.scala (2)

66-68: LGTM: Well-structured drift-related constants

The new constants follow consistent naming conventions and are logically grouped together. This provides a good foundation for the drift metrics implementation.

40-40: Verify the impact of metadata key rename

The constant has been renamed from ChrononMetadataKey to MetadataDataset with a different value. This is a breaking change that needs careful verification.

spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala (2)

9-9: LGTM: Import additions align with new functionality

The new imports for DriftStore and InMemoryKvStore are appropriate for the test modifications.

Also applies to: 17-17

77-77: Verify test coverage with reduced sample size

The sample size has been reduced from 1M to 100K rows. While this improves test execution time, ensure it still provides adequate coverage for edge cases and anomaly detection scenarios.

Also applies to: 80-80

spark/src/test/scala/ai/chronon/spark/test/StreamingTest.scala (2)

113-113: LGTM! Usage is consistent with import change.

The constant usage in inMemoryKvStore.create() is correctly updated to match the import change.

23-23: LGTM! Verify consistent usage across codebase.

The import change from ChrononMetadataKey to MetadataDataset improves naming clarity.

Let's verify consistent usage across the codebase:

✅ Verification successful

Verified consistent usage of MetadataDataset across the codebase.
All references to ChrononMetadataKey have been successfully replaced with MetadataDataset.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any remaining references to the old constant name
# and verify consistent usage of the new constant.

# Check for any remaining references to old constant
echo "Checking for remaining ChrononMetadataKey references:"
rg "ChrononMetadataKey"

# Check usage pattern of new constant
echo "Verifying MetadataDataset usage:"
rg "MetadataDataset"

Length of output: 3429

.github/workflows/test_scala_and_python.yaml (1)

67-67: Verify memory requirements for drift metrics tests

The significant increase in memory allocation (from 8G to 24G) suggests these tests have substantial memory requirements. While this aligns with the drift metrics implementation, we should verify this is the minimum required memory to prevent resource waste.

online/src/test/scala/ai/chronon/online/test/stats/DriftMetricsTest.scala (1)

1-11: LGTM! Test class setup follows best practices.

The test class is properly structured with appropriate imports and test framework extensions.

spark/src/test/scala/ai/chronon/spark/test/ExternalSourcesTest.scala (2)

18-18: LGTM: Import statement updated for better clarity

The change from ChrononMetadataKey to MetadataDataset provides a more descriptive name that better reflects its purpose.

97-97: LGTM: Constant usage updated consistently

The usage of MetadataDataset is consistent with the import change. The test's behavior remains unchanged.

Let's verify this refactoring is consistent across the codebase:

✅ Verification successful

Verification Successful: Consistent Usage of MetadataDataset Confirmed

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the consistent usage of MetadataDataset across the codebase
# and ensure no lingering references to the old ChrononMetadataKey

# Check for any remaining references to the old constant
echo "Checking for any remaining references to ChrononMetadataKey..."
rg "ChrononMetadataKey"

# Verify the usage of new constant
echo "Verifying usage of MetadataDataset..."
rg "MetadataDataset"

Length of output: 3457

spark/src/test/scala/ai/chronon/spark/test/InMemoryKvStore.scala (2)

45-45: LGTM! Explicit type annotation improves clarity.

The addition of explicit type annotation for the encoder enhances type safety and code readability.

120-120: LGTM! Improved thread safety with computeIfAbsent.

The change from put to computeIfAbsent makes the dataset creation atomic and prevents potential race conditions in concurrent scenarios.

spark/src/test/scala/ai/chronon/spark/test/OnlineUtils.scala (1)

160-160: LGTM! Verify constant usage consistency.

The rename from ChrononMetadataKey to MetadataDataset aligns with the standardization effort.

✅ Verification successful

Verified! All instances of ChrononMetadataKey have been successfully updated to MetadataDataset.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Verify that all occurrences of ChrononMetadataKey have been updated
rg "ChrononMetadataKey" --type scala

Length of output: 36

online/src/main/scala/ai/chronon/online/MetadataStore.scala (3)

19-19: LGTM: Import change aligns with the metadata dataset refactoring

The addition of MetadataDataset import is consistent with the broader changes to standardize metadata handling across the codebase.

204-204: LGTM: Dataset parameter change is consistent

The default value change from ChrononMetadataKey to MetadataDataset in the put method aligns with the constructor change.

44-44: Verify the impact of changing the default dataset value

The constructor's default dataset value has changed from ChrononMetadataKey to MetadataDataset. While this aligns with the refactoring, we should verify that all existing usages are updated.

✅ Verification successful

Change is Verified: Default Dataset Value Updated Successfully

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Find all MetadataStore instantiations to verify they handle the dataset parameter change

# Search for MetadataStore instantiations
echo "Searching for MetadataStore instantiations..."
rg -l "new MetadataStore" | while read -r file; do
  echo "=== $file ==="
  rg -A 2 "new MetadataStore" "$file"
done

# Search for potential references to the old ChrononMetadataKey
echo -e "\nSearching for potential references to ChrononMetadataKey..."
rg "ChrononMetadataKey"

Length of output: 4382

spark/src/test/scala/ai/chronon/spark/test/FetchStatsTest.scala (1)

23-23: LGTM: Constant renaming improves clarity

The renaming from ChrononMetadataKey to MetadataDataset is a good change that makes the constant's purpose clearer and more descriptive.

Also applies to: 142-142

build.sbt (2)

126-127: Verify architectural alignment of new web dependencies

The addition of cask and scalatags suggests implementing web-based visualization for drift metrics. However, this project already uses Play Framework in the hub module. Consider:

Whether this creates unnecessary framework duplication
If these visualizations should be integrated into the existing Play-based frontend instead

Let's verify the usage of these dependencies:

#!/bin/bash
# Search for cask/scalatags usage to understand the implementation approach
rg -l "import cask\." 
rg -l "import scalatags\."

# Check if there are any existing web endpoints in the online module
rg -l "extends cask.Routes"

125-127: Verify dependency versions for security vulnerabilities

Let's ensure the new dependencies are using secure versions.

✅ Verification successful

Dependencies are secure

No known security vulnerabilities found for the specified dependencies.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for security advisories and latest versions
gh api graphql -f query='
{
  securityVulnerabilities(first: 5, ecosystem: MAVEN, package: "com.lihaoyi:cask_2.12") {
    nodes {
      advisory {
        summary
        severity
        publishedAt
      }
      vulnerableVersionRange
      firstPatchedVersion {
        identifier
      }
    }
  }
}'

# Check Maven Central for latest versions
curl -s "https://search.maven.org/solrsearch/select?q=g:%22com.lihaoyi%22+AND+a:%22cask_2.12%22&rows=1&wt=json" | jq '.response.docs[0].latestVersion'
curl -s "https://search.maven.org/solrsearch/select?q=g:%22com.lihaoyi%22+AND+a:%22scalatags_2.12%22&rows=1&wt=json" | jq '.response.docs[0].latestVersion'

Length of output: 684

Script:

#!/bin/bash
# Check for security advisories for scalatags
gh api graphql -f query='
{
  securityVulnerabilities(first: 5, ecosystem: MAVEN, package: "com.lihaoyi:scalatags_2.12") {
    nodes {
      advisory {
        summary
        severity
        publishedAt
      }
      vulnerableVersionRange
      firstPatchedVersion {
        identifier
      }
    }
  }
}'

Length of output: 361

online/src/test/scala/ai/chronon/online/test/stats/PivotUtilsTest.scala (1)

1-11: LGTM! Well-structured test file setup

The file follows Scala testing best practices with appropriate imports and test framework setup.

spark/src/main/scala/ai/chronon/spark/stats/drift/Expressions.scala (3)

6-6: LGTM: Import change aligns with TileSummary refactoring

The import statement has been updated to reflect the renamed type, maintaining consistency with the broader refactoring effort.

Line range hint 93-101: LGTM: Type-safe summary population logic

The summary population logic maintains type safety while adapting to the new TileSummary type. The iterator-based transformation is efficient and maintains referential transparency.

76-76: Verify all TileRow usages for breaking change

The change from TileSummaries to TileSummary in the case class is a breaking change that affects serialization. This needs careful verification of:

All code that creates or consumes TileRow instances
Data migration strategy for existing data

✅ Verification successful

TileRow usage verified with no serialization impact found

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for TileRow usage patterns
echo "Searching for TileRow usage patterns..."
rg -l "TileRow" | while read -r file; do
  echo "=== $file ==="
  rg -A 5 "TileRow" "$file"
done

# Search for potential serialization code
echo "Checking for serialization related code..."
rg -l "TileRow.*(?:serialize|deserialize|fromJson|toJson)"

Length of output: 2262

online/src/main/scala/ai/chronon/online/Api.scala (2)

264-264: Verify the metadata constant update across the codebase

The change from Constants.ChrononMetadataKey to Constants.MetadataDataset appears to be part of a broader standardization of metadata handling. Let's ensure this update is consistent across the codebase.

Also applies to: 275-275

49-49: ⚠️ Potential issue

Update comment to reflect parameter rename from afterTsMillis to startTsMillis

The comment above the case class still references afterTsMillis but the parameter has been renamed to startTsMillis. This change affects the semantics of timestamp handling in scan requests.

Let's verify the impact of this breaking change:

spark/src/test/scala/ai/chronon/spark/test/ChainingFetcherTest.scala (2)

21-21: LGTM: Import statement follows conventions

The import of MetadataDataset from Constants is correctly placed and follows Scala's import organization conventions.

247-247: Verify metadata dataset creation across test suite

The change from ChrononMetadataKey to MetadataDataset appears to be part of a broader refactoring effort. While this change is correct, we should ensure consistent usage across all test cases.

✅ Verification successful

Metadata dataset creation across the test suite is consistent

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify consistent usage of MetadataDataset across test files
# Look for any remaining instances of ChrononMetadataKey that might have been missed

# Search for any remaining usage of ChrononMetadataKey
rg "ChrononMetadataKey" --type scala

# Search for consistent usage of MetadataDataset
rg "MetadataDataset" --type scala

Length of output: 3263

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (1)

130-130: Verify the impact of switching from afterTsMillis to startTsMillis

The change in partitioning logic from afterTsMillis to startTsMillis could affect query behavior. Ensure all callers have been updated to use startTsMillis.

✅ Verification successful

Partitioning logic correctly updated to use startTsMillis

All active references to afterTsMillis have been successfully replaced with startTsMillis. Remaining instances are limited to comments and do not impact functionality.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Search for any remaining references to afterTsMillis that might need updating

# Search for afterTsMillis usage in Scala files
echo "Searching for afterTsMillis usage in Scala files..."
rg -t scala "afterTsMillis" -A 5

# Search for KVStore.GetRequest construction to verify parameter usage
echo "Searching for GetRequest construction patterns..."
ast-grep --pattern 'GetRequest($$$)' -l scala

Length of output: 4345

spark/src/test/scala/ai/chronon/spark/test/SchemaEvolutionTest.scala (2)

322-322: LGTM: MetadataDataset usage in KV store initialization

The change from ChrononMetadataKey to MetadataDataset in the in-memory key-value store initialization is consistent with the refactoring pattern.

#!/bin/bash
# Description: Verify consistent initialization pattern across test files
# Expected: All test files should use MetadataDataset for KV store initialization

# Search for KV store initialization patterns in test files
rg "inMemoryKvStore\.create\(" --type scala -g "*Test.scala"

19-19: Verify consistent usage of MetadataDataset import

The import change from ChrononMetadataKey to MetadataDataset appears to be part of a broader refactoring effort.

✅ Verification successful

Consistent usage of MetadataDataset confirmed

No remaining instances of ChrononMetadataKey found. All usages appropriately updated to MetadataDataset.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify consistent usage of MetadataDataset across the codebase
# Expected: All occurrences of ChrononMetadataKey should be replaced with MetadataDataset

# Check for any remaining usage of ChrononMetadataKey
rg "ChrononMetadataKey"

# Verify the new MetadataDataset usage pattern
rg "Constants\.MetadataDataset" -A 2

Length of output: 5962

api/thrift/api.thrift (1)

286-306: Fix typo and standardize field naming as previously suggested.

The issues identified in previous reviews remain:

Typo: percentilDrift should be percentileDrift
Inconsistent naming: some fields use 'Change' while others use 'Drift'

Apply this diff to fix the issues:

 struct TileDrift {
-  1: optional double percentilDrift
+  1: optional double percentileDrift
-  3: optional double countChangePercent
-  4: optional double nullRatioChangePercent
+  3: optional double countDrift
+  4: optional double nullRatioDrift
-  5: optional double innerCountChangePercent
-  6: optional double innerNullCountChangePercent
+  5: optional double innerCountDrift
+  6: optional double innerNullCountDrift
 }

spark/src/test/scala/ai/chronon/spark/test/bootstrap/DerivationTest.scala (1)

433-433: LGTM! Verify constant renaming across the codebase.

The constant renaming from ChrononMetadataKey to MetadataDataset is consistent with the changes mentioned in the summary.

Let's verify that this constant renaming has been consistently applied across the codebase:

✅ Verification successful

LGTM! Constant renaming successfully verified across the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that the old constant name is not used anywhere in the codebase
# and the new constant name is used consistently.

echo "Checking for any remaining usage of the old constant name..."
rg "ChrononMetadataKey"

echo "Verifying usage of the new constant name..."
rg "MetadataDataset"

Length of output: 3465

online/src/main/scala/ai/chronon/online/FetcherBase.scala (3)

Line range hint 57-61: LGTM: Constructor parameter update looks good.

The change from ChrononMetadataKey to MetadataDataset as the default value for metaDataSet parameter is consistent with the changes mentioned in the summary.

24-24: LGTM: Import statement correctly added.

The import for MetadataDataset from the Constants object is properly placed and aligns with the constructor parameter change.

Line range hint 57-61: Verify test coverage for metadata handling.

Since this class is critical for data fetching and processing, please ensure that existing tests cover the metadata parameter change. This will help verify that the switch from ChrononMetadataKey to MetadataDataset doesn't affect data retrieval operations.

✅ Verification successful

Test Coverage Verified for Metadata Handling
Existing tests in FetcherBaseTest.scala cover the changes to the metaDataSet parameter, ensuring data retrieval operations are unaffected.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for test files covering FetcherBase metadata handling

# Look for test files related to FetcherBase
fd -e scala "FetcherBase.*Test.scala|Test.*FetcherBase.scala"

# Search for metadata-related test cases
rg -l "metaDataSet|ChrononMetadataKey|MetadataDataset" $(fd -e scala ".*Test.scala")

Length of output: 5872

spark/src/test/scala/ai/chronon/spark/test/FetcherTest.scala (3)

22-22: LGTM: Import change aligns with metadata constant renaming.

The import of MetadataDataset from Constants is correctly placed and consistent with the refactoring mentioned in the summary.

97-98: LGTM: Test data setup correctly uses new metadata constant.

The test data generation properly uses the new MetadataDataset constant for creating test datasets.

625-625: LGTM: Consistent usage of new metadata constant across test cases.

The metadata store creation consistently uses the new MetadataDataset constant in both test scenarios, maintaining the same functionality while adhering to the updated naming convention.

Also applies to: 755-755

spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala (2)

41-44: Ensure proper handling of null values in toPutRequest

The toPutRequest function substitutes null values with empty arrays for keyBytes and valueBytes. While this prevents NullPointerException, it may introduce issues if empty arrays are not valid inputs for the KVStore.

Verify whether keyBytes and valueBytes can be empty arrays. If not, consider modifying the code to handle null values appropriately, possibly by filtering out such records or throwing an exception to prevent invalid data from being processed.

6-22: 🛠️ Refactor suggestion

Avoid using scala.concurrent.ExecutionContext.Implicits.global in Spark jobs

Using the global execution context can lead to unpredicted behavior in Spark applications due to potential thread pool exhaustion or conflicts with Spark's own task scheduling.

Define a dedicated ExecutionContext that's suitable for the workload and integrated with Spark's threading model.

Apply this diff to define a custom execution context:

-import scala.concurrent.ExecutionContext.Implicits.global
+import org.apache.spark.SparkContext
+import org.apache.spark.util.ThreadUtils

...

 class SummaryUploader(summaryDF: DataFrame,
                       api: Api,
                       statsTableName: String = Constants.DriftStatsTable,
                       putsPerRequest: Int = 100,
                       requestsPerSecondPerExecutor: Int = 1000
 )(implicit tu: TableUtils)
     extends Serializable {

+  // Define an ExecutionContext using Spark's thread pool
+  implicit val executionContext = ExecutionContext.fromExecutorService(
+    SparkContext.getOrCreate().getExecutorMemoryStatus.keys.size match {
+      case 0 => ThreadUtils.newDaemonCachedThreadPool("SummaryUploader", 4)
+      case n => ThreadUtils.newDaemonCachedThreadPool("SummaryUploader", n * 4)
+    }
+  )

   def run(): Unit = {

Ensure that you shut down the custom execution context when it's no longer needed to prevent resource leaks.

⛔ Skipped due to learnings

Learnt from: chewy-zlai
PR: zipline-ai/chronon#50
File: spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala:37-40
Timestamp: 2024-11-12T09:38:33.532Z
Learning: Avoid using `Await.result` in production code; prefer handling `Future`s asynchronously when possible to prevent blocking.

Learnt from: chewy-zlai
PR: zipline-ai/chronon#62
File: spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala:9-10
Timestamp: 2024-11-12T09:38:33.532Z
Learning: In Spark applications, when defining serializable classes, passing an implicit `ExecutionContext` parameter can cause serialization issues. In such cases, it's acceptable to use `scala.concurrent.ExecutionContext.Implicits.global`.

Learnt from: nikhil-zlai
PR: zipline-ai/chronon#50
File: spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala:19-47
Timestamp: 2024-11-12T09:38:33.532Z
Learning: In Scala, the `grouped` method on collections returns an iterator, allowing for efficient batch processing without accumulating all records in memory.

online/src/main/scala/ai/chronon/online/stats/DriftMetrics.scala (1)

108-109: Consider handling zero probabilities in klDivergenceTerm

In the klDivergenceTerm method, when a > 0 and b is zero, the result should theoretically be infinite. Currently, the method returns zero in this case. Consider whether this behavior aligns with the intended statistical interpretation and whether an adjustment is necessary.

If adjusting is needed, you might handle the case where b == 0.0 separately.

api/src/main/scala/ai/chronon/api/Extensions.scala (1)

915-921: ⚠️ Potential issue

Prevent NullPointerException by checking join.derivations in renamedColumns.

Accessing join.derivations without a null check can lead to a NullPointerException if it's null. Ensure you safely handle this possibility.

Adjust the code to include a null check:

-Option(join.derivations)
-  .map {
-    _.toScala.renameOnlyDerivations.map(_.expression).toSet
-  }
-  .getOrElse(Set.empty)
+Option(join.derivations)
+  .map { derivations =>
+    derivations.toScala.renameOnlyDerivations.map(_.expression).toSet
+  }
+  .getOrElse(Set.empty)

Likely invalid or redundant comment.

spark/src/test/scala/ai/chronon/spark/test/stats/drift/PrepareData.scala (2)

7-8: LGTM: Necessary imports added

The imports for DriftMetric and DriftSpec are appropriately added for the drift analysis enhancements.

134-138: LGTM: driftSpec integrated into metaData

The driftSpec is correctly added to the metaData in the join operation, enabling drift analysis configurations.

coderabbitai · 2024-11-12T13:49:01Z

api/src/main/scala/ai/chronon/api/Constants.scala

  val DriftStatsTable: String = "drift_statistics"
+
+  val DefaultDriftTileSize: Window = new Window(30, TimeUnit.MINUTES)


⚠️ Potential issue

Add missing Window import and document the tile size choice

Two issues to address:

The Window class import is missing

The 30-minute default tile size needs documentation explaining the rationale

Add the following import at the top of the file:

+ import ai.chronon.core.Window

Consider adding a comment explaining why 30 minutes was chosen as the default tile size for drift calculations.

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2024-11-12T13:49:01Z

spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala

+    val driftStore = new DriftStore(api.genKvStore, namespace)
+
+    val tileKeys = driftStore.tileKeysForJoin(join)
+
+    println(tileKeys)


🛠️ Refactor suggestion

Add assertions for DriftStore results

The test prints tile keys but doesn't validate them. Consider adding assertions to verify the expected behavior.

val driftStore = new DriftStore(api.genKvStore, namespace) val tileKeys = driftStore.tileKeysForJoin(join) -println(tileKeys) +// Add assertions +assert(tileKeys.nonEmpty, "Tile keys should not be empty") +assert(tileKeys.forall(_.contains(namespace)), s"All tile keys should contain namespace: $namespace")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

val driftStore = new DriftStore(api.genKvStore, namespace)

val tileKeys = driftStore.tileKeysForJoin(join)

println(tileKeys)

val driftStore = new DriftStore(api.genKvStore, namespace)

val tileKeys = driftStore.tileKeysForJoin(join)

// Add assertions

assert(tileKeys.nonEmpty, "Tile keys should not be empty")

assert(tileKeys.forall(_.contains(namespace)), s"All tile keys should contain namespace: $namespace")

coderabbitai · 2024-11-12T13:49:01Z

spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala

+    val kvStoreFunc: () => KVStore = () => {
+      // cannot reuse the variable - or serialization error
+      val result = InMemoryKvStore.build("drift_test", () => null)
+      result
+    }
+
+    val api = new MockApi(kvStoreFunc, namespace)
+    val kvStore = api.genKvStore
+    kvStore.create(Constants.MetadataDataset)
+    kvStore.create(Constants.DriftStatsTable)


🛠️ Refactor suggestion

Add cleanup for KVStore resources

The test creates KVStore tables but doesn't clean them up after the test. Consider adding cleanup in a finally block or using a test fixture.

val kvStoreFunc: () => KVStore = () => { val result = InMemoryKvStore.build("drift_test", () => null) + // Register cleanup hook + sys.addShutdownHook { + try { + result.drop(Constants.MetadataDataset) + result.drop(Constants.DriftStatsTable) + } catch { + case _: Exception => // Ignore cleanup errors + } + } result }

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2024-11-12T13:49:01Z

.github/workflows/test_scala_and_python.yaml

@@ -49,7 +49,7 @@ jobs:
          path: api/py/htmlcov

  other_spark_tests:
-    runs-on: ubuntu-latest
+    runs-on: ubuntu-8_cores-32_gb


⚠️ Potential issue

Critical: Invalid runner label will break CI

The runner label ubuntu-8_cores-32_gb is not a standard GitHub-hosted runner. This will cause the CI workflow to fail.

For GitHub-hosted runners with more resources, use:

ubuntu-latest-8-cores for 8 core runner

ubuntu-latest-16-cores for 16 core runner

Apply this diff to use a valid GitHub-hosted runner:

- runs-on: ubuntu-8_cores-32_gb + runs-on: ubuntu-latest-8-cores

If you're using self-hosted runners:

Ensure the runner label is properly configured

Add an actionlint.yaml to specify custom runner labels

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

runs-on: ubuntu-8_cores-32_gb

runs-on: ubuntu-latest-8-cores

🧰 Tools

🪛 actionlint

52-52: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-14.0", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-13.0", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "macos-12.0", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

coderabbitai · 2024-11-12T13:49:02Z

online/src/test/scala/ai/chronon/online/test/stats/DriftMetricsTest.scala

+    Map(
+      DriftMetric.JENSEN_SHANNON -> calculateDrift(DriftMetric.JENSEN_SHANNON),
+      DriftMetric.PSI -> calculateDrift(DriftMetric.PSI),
+      DriftMetric.HELLINGER -> calculateDrift(DriftMetric.HELLINGER)
+    )


⚠️ Potential issue

Add tests for PSI metric.

The PSI (Population Stability Index) metric is calculated but not tested in any test cases.

Add assertions for PSI in test cases similar to JSD and Hellinger.

coderabbitai · 2024-11-12T13:49:05Z

api/src/main/scala/ai/chronon/api/Extensions.scala

+    def derivedColumns: Array[String] =
+      Option(join.getDerivations) match {
+        case Some(derivations) =>
+          derivations.toScala.flatMap { derivation =>
+            derivation.getName match {
+              case "*" => None
+              case _   => Some(derivation.getName)
            }
-          }
-          .flatMap(_.toSet))
-        .getOrElse(Seq.empty)
+          }.toArray
+        case None => Array.empty
+      }


⚠️ Potential issue

Handle potential null names in derivedColumns.

In the derivedColumns method, derivation.getName could be null, leading to a NullPointerException. Ensure that null names are appropriately handled.

Modify the code to filter out null names:

-derivations.toScala.flatMap { derivation => - derivation.getName match { - case "*" => None - case _ => Some(derivation.getName) - } -}.toArray +derivations.toScala.flatMap { derivation => + Option(derivation.getName) match { + case Some("*") => None + case Some(name) => Some(name) + case None => None + } +}.toArray

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def derivedColumns: Array[String] =

Option(join.getDerivations) match {

case Some(derivations) =>

derivations.toScala.flatMap { derivation =>

derivation.getName match {

case "*" => None

case _ => Some(derivation.getName)

}

}

.flatMap(_.toSet))

.getOrElse(Seq.empty)

}.toArray

case None => Array.empty

}

def derivedColumns: Array[String] =

Option(join.getDerivations) match {

case Some(derivations) =>

derivations.toScala.flatMap { derivation =>

Option(derivation.getName) match {

case Some("*") => None

case Some(name) => Some(name)

case None => None

}

}.toArray

case None => Array.empty

}

coderabbitai · 2024-11-12T13:49:05Z

api/src/main/scala/ai/chronon/api/Extensions.scala

+    def outputValueColumnsMap: Map[String, Array[String]] = {
+      val preDeriveCols = (joinPartColumns ++ externalPartColumns)
+      val preDerivedWithoutRenamed = preDeriveCols.mapValues(_.filterNot(renamedColumns.contains))
+      val derivedColumns: Array[String] = Option(join.derivations) match {
+        case Some(derivations) => derivations.toScala.map { _.getName }.filter(_ == "*").toArray
+        case None              => Array.empty
+      }
+      preDerivedWithoutRenamed ++ Map("derivations" -> derivedColumns)
+    }


🛠️ Refactor suggestion

Avoid using mapValues to prevent lazy evaluation issues.

The mapValues method in Scala returns a view, which can lead to unexpected behavior if the underlying map changes. It's safer to use map to create a new map instead.

Replace mapValues with map:

-val preDerivedWithoutRenamed = preDeriveCols.mapValues(_.filterNot(renamedColumns.contains)) +val preDerivedWithoutRenamed = preDeriveCols.map { case (k, v) => + k -> v.filterNot(renamedColumns.contains) +}

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def outputValueColumnsMap: Map[String, Array[String]] = {

val preDeriveCols = (joinPartColumns ++ externalPartColumns)

val preDerivedWithoutRenamed = preDeriveCols.mapValues(_.filterNot(renamedColumns.contains))

val derivedColumns: Array[String] = Option(join.derivations) match {

case Some(derivations) => derivations.toScala.map { _.getName }.filter(_ == "*").toArray

case None => Array.empty

}

preDerivedWithoutRenamed ++ Map("derivations" -> derivedColumns)

}

def outputValueColumnsMap: Map[String, Array[String]] = {

val preDeriveCols = (joinPartColumns ++ externalPartColumns)

val preDerivedWithoutRenamed = preDeriveCols.map { case (k, v) =>

k -> v.filterNot(renamedColumns.contains)

}

val derivedColumns: Array[String] = Option(join.derivations) match {

case Some(derivations) => derivations.toScala.map { _.getName }.filter(_ == "*").toArray

case None => Array.empty

}

preDerivedWithoutRenamed ++ Map("derivations" -> derivedColumns)

}

coderabbitai · 2024-11-12T13:49:05Z

api/src/main/scala/ai/chronon/api/Extensions.scala

+      val joinPartKeys = join.joinParts.toScala.flatMap(_.groupBy.keyColumns.toScala).toSet
+      val externalKeys = join.onlineExternalParts.toScala.flatMap(_.source.keyNames).toSet
+      val bootstrapKeys = join.bootstrapParts.toScala.flatMap(_.keyColumns.toScala).toSet
+      (joinPartKeys ++ externalKeys ++ bootstrapKeys).toArray


⚠️ Potential issue

Ensure safe access to collections in keyColumns to prevent NullPointerException.

Accessing join.onlineExternalParts or join.bootstrapParts without null checks can lead to NullPointerException if they are null. Safely handle these potential null values.

Modify the code to include null checks:

-val externalKeys = join.onlineExternalParts.toScala.flatMap(_.source.keyNames).toSet -val bootstrapKeys = join.bootstrapParts.toScala.flatMap(_.keyColumns.toScala).toSet +val externalKeys = Option(join.onlineExternalParts) + .map(_.toScala.flatMap(_.source.keyNames).toSet) + .getOrElse(Set.empty) +val bootstrapKeys = Option(join.bootstrapParts) + .map(_.toScala.flatMap(_.keyColumns.toScala).toSet) + .getOrElse(Set.empty)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

val joinPartKeys = join.joinParts.toScala.flatMap(_.groupBy.keyColumns.toScala).toSet

val externalKeys = join.onlineExternalParts.toScala.flatMap(_.source.keyNames).toSet

val bootstrapKeys = join.bootstrapParts.toScala.flatMap(_.keyColumns.toScala).toSet

(joinPartKeys ++ externalKeys ++ bootstrapKeys).toArray

val joinPartKeys = join.joinParts.toScala.flatMap(_.groupBy.keyColumns.toScala).toSet

val externalKeys = Option(join.onlineExternalParts)

.map(_.toScala.flatMap(_.source.keyNames).toSet)

.getOrElse(Set.empty)

val bootstrapKeys = Option(join.bootstrapParts)

.map(_.toScala.flatMap(_.keyColumns.toScala).toSet)

.getOrElse(Set.empty)

(joinPartKeys ++ externalKeys ++ bootstrapKeys).toArray

coderabbitai · 2024-11-12T13:49:05Z

api/src/main/scala/ai/chronon/api/Extensions.scala

+    def externalPartColumns: Map[String, Array[String]] =
+      Option(join.onlineExternalParts) match {
+        case Some(parts) =>
+          parts.toScala.map { part =>
+            val keys = part.source.getKeySchema.params.toScala.map(_.name)
+            val values = part.source.getValueSchema.params.toScala.map(_.name)
+            part.fullName -> (keys ++ values).toArray
+          }.toMap
+        case None => Map.empty
+      }


⚠️ Potential issue

Ensure safe handling of potential null values in externalPartColumns.

When accessing part.source.getKeySchema and part.source.getValueSchema, there is a possibility of encountering null values, which could lead to a NullPointerException. It's important to safely handle these potential nulls.

Consider modifying the code to safely handle null values:

-parts.toScala.map { part => - val keys = part.source.getKeySchema.params.toScala.map(_.name) - val values = part.source.getValueSchema.params.toScala.map(_.name) - part.fullName -> (keys ++ values).toArray +parts.toScala.map { part => + val keys = Option(part.source.getKeySchema) + .map(_.params.toScala.map(_.name)) + .getOrElse(Seq.empty) + val values = Option(part.source.getValueSchema) + .map(_.params.toScala.map(_.name)) + .getOrElse(Seq.empty) + part.fullName -> (keys ++ values).toArray

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def externalPartColumns: Map[String, Array[String]] =

Option(join.onlineExternalParts) match {

case Some(parts) =>

parts.toScala.map { part =>

val keys = part.source.getKeySchema.params.toScala.map(_.name)

val values = part.source.getValueSchema.params.toScala.map(_.name)

part.fullName -> (keys ++ values).toArray

}.toMap

case None => Map.empty

}

def externalPartColumns: Map[String, Array[String]] =

Option(join.onlineExternalParts) match {

case Some(parts) =>

parts.toScala.map { part =>

val keys = Option(part.source.getKeySchema)

.map(_.params.toScala.map(_.name))

.getOrElse(Seq.empty)

val values = Option(part.source.getValueSchema)

.map(_.params.toScala.map(_.name))

.getOrElse(Seq.empty)

part.fullName -> (keys ++ values).toArray

}.toMap

case None => Map.empty

}

coderabbitai · 2024-11-12T13:49:06Z

spark/src/test/scala/ai/chronon/spark/test/stats/drift/PrepareData.scala

+    val maxGap = totalDays - windowLengths.sum
+    val gapDays = RandomUtils.between(2, maxGap)


⚠️ Potential issue

Handle potential negative maxGap to prevent exceptions

There's a risk that maxGap could be negative if totalDays is less than the sum of windowLengths. This would cause RandomUtils.between(2, maxGap) to throw an exception.

Consider adding a check to ensure maxGap is sufficient:

val windowLengths = List.fill(numWindows)(RandomUtils.between(3, 8)) val maxGap = totalDays - windowLengths.sum +if (maxGap < 2) { + throw new IllegalArgumentException("Insufficient total days to generate non-overlapping windows with the given window lengths.") +} val gapDays = RandomUtils.between(2, maxGap)

Committable suggestion skipped: line range outside the PR's diff.

nikhil-zlai · 2024-11-18T18:50:05Z

online/src/main/scala/ai/chronon/online/stats/AssignIntervals.scala

+
+// note: the interval convention is left inclusive, right exclusive
+//       please see the test case to understand the algorithm better
+// at a high level we are converting equally percentile points into histograms / distributions


how to bin: https://arize.com/blog-course/data-binning-production/

piyush-zlai

@nikhil-zlai - dropped some comments based on todos etc. Feel free to bounce over

piyush-zlai · 2024-11-18T21:43:32Z

api/src/main/scala/ai/chronon/api/Builders.scala

@@ -267,7 +267,8 @@ object Builders {
        samplePercent: Double = 100,
        consistencySamplePercent: Double = 5,
        tableProperties: Map[String, String] = Map.empty,
-        historicalBackill: Boolean = true
+        historicalBackill: Boolean = true,


nit: s/backill/backfill

piyush-zlai · 2024-11-18T21:48:37Z

api/src/main/scala/ai/chronon/api/Constants.scala

@@ -37,7 +37,7 @@ object Constants {
  val ChrononDynamicTable = "chronon_dynamic_table"
  val ChrononOOCTable: String = "chronon_ooc_table"
  val ChrononLogTable: String = "chronon_log_table"
-  val ChrononMetadataKey = "ZIPLINE_METADATA"
+  val MetadataDataset = "chronon_metadata"


make uppercase for consistency?

piyush-zlai · 2024-11-18T21:49:05Z

api/src/main/scala/ai/chronon/api/Constants.scala

@@ -62,5 +62,9 @@ object Constants {
  val LabelViewPropertyFeatureTable: String = "feature_table"
  val LabelViewPropertyKeyLabelTable: String = "label_table"
  val ChrononRunDs: String = "CHRONON_RUN_DS"
+
+  val TiledSummaryDataset: String = "tiled_summaries"


api/thrift/api.thrift

piyush-zlai · 2024-11-18T22:28:55Z

online/src/main/scala/ai/chronon/online/stats/Display.scala

+import scalatags.Text.all._
+import scalatags.Text.tags2.title
+
+// generates html / js code to serve a tabbed board on the network port


piyush-zlai · 2024-11-18T22:29:47Z

online/src/main/scala/ai/chronon/online/stats/DriftMetrics.scala

+    var result = 0.0
+
+    // debug only, remove before merging
+    val deltas = Array.ofDim[Double](aNormalized.length)


piyush-zlai · 2024-11-18T22:29:59Z

online/src/main/scala/ai/chronon/online/stats/DriftMetrics.scala

+      val delta = func(ai, bi)
+
+      // debug only remove before merging
+      deltas.update(i, delta)


here too - remove?

done - good catch!

piyush-zlai · 2024-11-18T22:36:40Z

online/src/main/scala/ai/chronon/online/stats/DriftStore.scala

+        _ match {
+          case Success(responseContext) => Some(responseContext)
+          // TODO instrument failures
+          case Failure(exception) => exception.printStackTrace(); None


should we address in current PR? I can also pick up in my follow up

works for me

piyush-zlai · 2024-11-18T22:38:24Z

online/src/main/scala/ai/chronon/online/stats/DriftStore.scala

+    getJoinConf(join).map { joinConf =>
+      // TODO-explore: we might be over fetching if lookBack is much larger than end - start
+      // if we can merge tile summary infos, we can call this twice with different ranges and merge
+      getSummaries(joinConf.join, Some(startMs - lookBack.millis), Some(endMs)).map { tileSummaryInfos =>


should we also subtract lookback.millis from endMs? IIUC we want the full start to endMs time range shifted by lookback right?

so lets say we want drift between last week and today, we need summaries ranging from two weeks ago to today.

I was thinking we'd need two potentially disjointed timeframes. For example, if the user is looking at say the month of 2024-12-01 to 2025-01-01 and they ask for an offset of 3m. Don't we need one range of: 2024-09-01 to 2024-10-01 and the other range to be 2024-12-01 to 2025-01-01?

I think what you're getting at also works - it sounds like the lookback window to compute drift over though semantics / UX might be different. A case where I think the disjointed approach might showcase things better might be say you want to see how payment features drifted over BFCM week last year and this year - you could do a 1 week or 2 week time range and include a 1 year offset so we only compared last 2 weeks of Nov last year with last 2 weeks of Nov this year. That might cause issues to pop more explicitly than the drift over the whole of last year ending in Nov of this year?

coderabbitai

Actionable comments posted: 11

🧹 Outside diff range and nitpick comments (28)

hub/conf/routes (4)
6-7: Consider making the algorithm parameter optional and add documentation.

The new model timeseries endpoint looks good, but consider:

Making algorithm an optional parameter like in the join/feature endpoints

Adding route comments to document the purpose of each parameter
-# model prediction & model drift
-GET     /api/v1/model/:id/timeseries                     controllers.TimeSeriesController.fetchModel(id: String, startTs: Long, endTs: Long, offset: String, algorithm: String)
+# Fetch model prediction & drift metrics
+# Parameters:
+#   id: Model identifier
+#   startTs: Start timestamp for the time window
+#   endTs: End timestamp for the time window
+#   offset: Time offset for the window
+#   algorithm: (Optional) Drift detection algorithm
+GET     /api/v1/model/:id/timeseries                     controllers.TimeSeriesController.fetchModel(id: String, startTs: Long, endTs: Long, offset: String, algorithm: Option[String])
9-11: Enhance the endpoint documentation.

The join timeseries endpoint is well-structured, but the documentation could be more comprehensive.
-# all timeseries of a given join id
-# when metricType == "drift" - will return time series list of drift values
+# Fetch timeseries data for a specific join
+# Parameters:
+#   name: Join identifier
+#   startTs: Start timestamp for the time window
+#   endTs: End timestamp for the time window
+#   metricType: Type of metric to fetch (e.g., "drift" returns time series of drift values)
+#   metrics: Specific metrics to include
+#   offset: (Optional) Time offset for the window
+#   algorithm: (Optional) Drift detection algorithm
13-15: Clarify the relationship between joins and features.

The feature timeseries endpoint looks good, but:

The comment "join -> seq(feature)" needs elaboration

The purpose of the granularity parameter should be documented
-# join -> seq(feature)
-# when metricType == "drift" - will return time series list of drift values
+# Fetch timeseries data for a specific feature
+# Note: Features are components of joins, where one join contains multiple features
+# Parameters:
+#   name: Feature identifier
+#   startTs: Start timestamp for the time window
+#   endTs: End timestamp for the time window
+#   metricType: Type of metric to fetch (e.g., "drift" returns time series of drift values)
+#   metrics: Specific metrics to include
+#   granularity: Time granularity for the data points
+#   offset: (Optional) Time offset for the window
+#   algorithm: (Optional) Drift detection algorithm
18-26: Review the architectural approach for fine-grained endpoints.

The TODO section suggests breaking down the feature endpoint into more specific ones, but there are some concerns:

All commented endpoints currently map to the same controller method with identical parameters

Each specialized endpoint (percentiles, values, drift, etc.) likely needs different parameters and return types

Consider:

Creating dedicated controller methods for each endpoint type

Defining specific parameter sets appropriate for each operation

Creating a tracking issue for this architectural change

Would you like me to help create a GitHub issue to track this architectural task?
spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala (1)
39-45: Consider consolidating schema validation

The schema validation could be more concise and maintainable.

Consider this approach:
private case class ColumnSpec(name: String, expectedType: DataType)
private val requiredColumns = Seq(
  ColumnSpec("keyBytes", BinaryType),
  ColumnSpec("valueBytes", BinaryType),
  ColumnSpec("timestamp", LongType)
)

requiredColumns.foreach { spec =>
  val idx = summaryDF.schema.fieldIndex(spec.name)
  require(summaryDF.schema(idx).dataType == spec.expectedType, 
    s"${spec.name} must be ${spec.expectedType}")
}
spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala (4)
Line range hint 34-41: Add cleanup for SparkSession and database resources

Consider adding cleanup in an afterAll hook to ensure proper resource cleanup after tests complete.
 class DriftTest extends AnyFlatSpec with Matchers {
+  override def afterAll(): Unit = {
+    try {
+      tableUtils.dropDatabase(namespace, cascade = true)
+      spark.stop()
+    } finally {
+      super.afterAll()
+    }
+  }
Line range hint 43-68: Add test assertions and extract constants

The test lacks assertions to verify the behavior and uses hardcoded values.
+  private val DefaultRowCount = 100000
+  private val DefaultPartitions = 10
+
   "TestDataFrameGenerator" should "create a DataFrame with various column types including nulls" in {
     try {
-      val df = generateDataFrame(100000, 10, namespace)
+      val df = generateDataFrame(DefaultRowCount, DefaultPartitions, namespace)

       // Check if DataFrame is created
-      df.isEmpty shouldBe false
+      df.count() shouldBe DefaultRowCount
       df.show(10, truncate = false)

       val summarizer = new Summarizer("drift_test_basic")
       val (result, summaryExprs) = summarizer.computeSummaryDf(df)
+      result should not be null
+      summaryExprs should not be empty
       result.show()
Line range hint 70-81: Consider using a proper logging framework

Instead of using println with color formatting, consider using a proper logging framework like Logback or Log4j for better test output control and consistency across environments.
+  private val logger = org.slf4j.LoggerFactory.getLogger(getClass)
+
   def showTable(name: String)(implicit tableUtils: TableUtils): Unit = {
-    println(s"Showing table $name".yellow)
+    logger.info(s"Showing table: $name")
     val df = tableUtils.loadTable(name)
     val maxColNameLength = df.schema.fieldNames.map(_.length).max
     def pad(s: String): String = s.padTo(maxColNameLength, ' ')
     df.schema.fields.foreach { f =>
-      println(s"    ${pad(f.name)} : ${f.dataType.typeName}".yellow)
+      logger.info(s"    ${pad(f.name)} : ${f.dataType.typeName}")
     }
85-89: Extract test configuration to constants

Hardcoded values should be extracted to constants for better maintainability and configurability.
+  private object TestConfig {
+    val SampleSize = 100000
+    val StartDate = "2023-01-01"
+    val EndDate = "2023-01-30"
+  }
+
   "end_to_end" should "fetch prepare anomalous data, summarize, upload and fetch without failures" in {
     val prepareData = PrepareData(namespace)
     val join = prepareData.generateAnomalousFraudJoin
-    val df = prepareData.generateFraudSampleData(100000, "2023-01-01", "2023-01-30", join.metaData.loggedTable)
+    val df = prepareData.generateFraudSampleData(
+      TestConfig.SampleSize,
+      TestConfig.StartDate,
+      TestConfig.EndDate,
+      join.metaData.loggedTable
+    )
online/src/main/scala/ai/chronon/online/stats/PivotUtils.scala (2)
16-19: Remove outdated TODO comment

The TODO comment about TileDrift => TileDriftSeries can be removed since the implementation is already present in this file.

141-156: Improve readability of long method chains

The long method chains could be more readable with proper formatting. Consider breaking them into multiple lines with proper indentation.

Example formatting:
-      .setNullRatioChangePercentSeries(doubleIterator(_.isSetNullRatioChangePercent, _.getNullRatioChangePercent) |> collectDoubles)
+      .setNullRatioChangePercentSeries(
+         doubleIterator(
+           _.isSetNullRatioChangePercent,
+           _.getNullRatioChangePercent
+         ) |> collectDoubles
+      )
online/src/main/scala/ai/chronon/online/stats/DriftStore.scala (5)
35-35: Enhance error message for better debugging.

Consider providing more context in the error message to help diagnose configuration issues.
-require(tileSize.nonEmpty, s"Drift tile size not set for join $joinName")
+require(tileSize.nonEmpty, s"Drift tile size configuration missing for join '$joinName'. Please ensure drift_tile_size is set in the join metadata.")
61-65: Add validation for empty summaries array.

The toDriftSeries method should validate the input array to handle edge cases gracefully.
 def toDriftSeries(driftMetric: DriftMetric, lookBack: Window, startMs: Long): TileDriftSeries = {
+  require(summaries.nonEmpty, s"Cannot create drift series for empty summaries array with key: ${key}")
   val driftsArray = TileDriftCalculator.toTileDrifts(summaries, driftMetric, startMs, lookBack)
   val result = PivotUtils.pivot(driftsArray)
   result.setKey(key)
 }
114-114: Address TODO for failure instrumentation.

The TODO comment indicates missing failure metrics. Consider adding metrics to track:

Failure rates

Error types

Latency distribution

Would you like me to help implement the failure instrumentation using a metrics framework?

138-140: Consider optimizing data fetching strategy.

The TODO comment raises a valid concern about over-fetching. Consider implementing a strategy to:

Split the fetch into two ranges

Merge the results efficiently

Cache frequently accessed ranges

Would you like help designing an optimized fetching strategy?

165-166: Document the compact protocol bug.

The comment warns about a bug but lacks details. Please document:

The specific issue with compact protocol

Any related bug tickets or references

Potential future migration path
online/src/main/scala/ai/chronon/online/stats/Display.scala (2)
12-25: Consider implementing a consistent color palette

The case class hierarchy is well-designed, but the random color generation in chart rendering (used later) could lead to readability issues or poor color contrast. Consider implementing a predefined color palette for better visualization consistency.

Example implementation:
private val ColorPalette = Array(
  "rgb(66, 133, 244)",  // Google Blue
  "rgb(219, 68, 55)",   // Google Red
  "rgb(244, 180, 0)",   // Google Yellow
  "rgb(15, 157, 88)",   // Google Green
  // Add more colors as needed
)

// Then use ColorPalette(index % ColorPalette.length) instead of random colors
1-205: Consider adding error handling and monitoring capabilities

As this is part of the drift metrics implementation, consider these architectural improvements:

Add error handling for failed chart rendering

Implement monitoring endpoints for server health

Consider making the visualization dynamic for real-time drift metrics updates

Would you like me to provide a detailed implementation plan for these improvements?
spark/src/test/scala/ai/chronon/spark/test/OnlineUtils.scala (2)
Line range hint 142-156: Improve robustness of async handling in putStreamingNew

The current implementation has several concerns:

The hardcoded sleep duration of 5000ms may be insufficient for larger datasets

No error handling for async operations

Vague documentation about the async behavior

Consider:

Making the sleep duration configurable

Adding proper error handling

Improving documentation about the async behavior
 def putStreamingNew(originalGroupByConf: api.GroupBy,
                     ds: String,
                     namespace: String,
                     kvStoreFunc: () => KVStore,
-                    debug: Boolean)(implicit session: SparkSession): Unit = {
+                    debug: Boolean,
+                    asyncWaitMs: Long = 5000)(implicit session: SparkSession): Unit = {
   implicit val mockApi = new MockApi(kvStoreFunc, namespace)
   val groupByConf = originalGroupByConf.deepCopy()
   val source = groupByConf.streamingSource.get
   mutateTopicWithDs(source, ds)
   val groupByStreaming = new JoinSourceRunner(groupByConf, Map.empty, debug = debug, lagMillis = 0)
   val query = groupByStreaming.chainedStreamingQuery.trigger(Trigger.Once()).start()
-  // drain work scheduled as futures over the executioncontext
-  Thread.sleep(5000)
-  // there is async stuff under the hood of chained streaming query
-  query.awaitTermination()
+  try {
+    // Wait for async operations in the chained streaming query to complete
+    // These operations include future completions in the execution context
+    Thread.sleep(asyncWaitMs)
+    query.awaitTermination()
+  } catch {
+    case e: Exception =>
+      query.stop()
+      throw new RuntimeException(s"Failed to process streaming query: ${e.getMessage}", e)
+  }
 }
Also, consider adding more context to the TODO comment about deprecating putStreaming:
-// TODO - deprecate putStreaming
+// TODO: Deprecate putStreaming in favor of putStreamingNew which provides better support for join sources
+// Timeline: Remove in next major version
Line range hint 1-187: Enhance code documentation and consistency

The test utilities would benefit from:

ScalaDoc comments for public methods explaining their purpose and usage in tests

Consistent error handling patterns across streaming operations

Clear documentation of test setup requirements

Consider adding ScalaDoc comments for key methods, for example:
/** Processes streaming data with join source support.
 *
 * @param originalGroupByConf The group by configuration to process
 * @param ds The dataset identifier
 * @param namespace The namespace for the mock API
 * @param kvStoreFunc Function to create KVStore instances
 * @param debug Enable debug logging
 * @param session Implicit SparkSession
 */
def putStreamingNew(...)
Also consider:

Creating a common error handling trait/interface for streaming operations

Adding integration tests for the different streaming scenarios

Documenting the test data requirements and setup in the README
spark/src/main/scala/ai/chronon/spark/stats/drift/Expressions.scala (2)
111-137: Consider extracting common conversion patterns

The metric population logic has repeated patterns for converting Scala collections to Java collections. Consider extracting these conversions into helper methods for better maintainability.
+ private object Conversions {
+   def toJavaDoubles(seq: Seq[Double]): java.util.List[java.lang.Double] =
+     seq.map(java.lang.Double.valueOf).toJava
+   
+   def toJavaIntegers(seq: Seq[Int]): java.util.List[java.lang.Integer] =
+     seq.map(java.lang.Integer.valueOf).toJava
+ }

  case MetricName.percentiles =>
    (row: Row, summaries: TileSummary) =>
-     summaries.setPercentiles(row.getSeq[Double](index).map(lang.Double.valueOf).toJava)
+     summaries.setPercentiles(Conversions.toJavaDoubles(row.getSeq[Double](index)))

  case MetricName.lengthPercentiles =>
    (row: Row, summaries: TileSummary) =>
-     summaries.setLengthPercentiles(row.getSeq[Int](index).map(lang.Integer.valueOf).toJava)
+     summaries.setLengthPercentiles(Conversions.toJavaIntegers(row.getSeq[Int](index)))
Line range hint 173-174: Consider addressing TODOs for map key cardinality

There are TODO comments about handling map key cardinality and heavy hitters. Since this PR is about drift metrics, would you like me to create GitHub issues to track these improvements?
spark/src/main/scala/ai/chronon/spark/stats/drift/Summarizer.scala (1)

Line range hint 1-324: Consider enhancing test coverage and documentation

As this is a WIP PR implementing drift metrics functionality:

The code handles critical time series data processing and serialization. Consider adding:

Unit tests for edge cases in time window calculations

Integration tests for the entire drift metrics pipeline

Performance benchmarks for large datasets

Documentation improvements needed:

Add scaladoc comments explaining the drift metrics algorithm

Document the expected format of input/output data

Include examples of typical usage patterns

Would you like me to help generate:

Unit test templates for the Summarizer and SummaryPacker classes?

Documentation templates with usage examples?
spark/src/test/scala/ai/chronon/spark/test/stats/drift/PrepareData.scala (2)
132-138: Consider enhancing the DriftSpec configuration

The current configuration has some potential improvements:

Consider using Scala collections instead of Java's ArrayList

The single 30-minute lookback window might be limiting for drift detection

The hard-coded values would benefit from named constants or configuration parameters
-    val driftSpec = new DriftSpec()
-    driftSpec.setTileSize(new Window(30, TimeUnit.MINUTES))
-    driftSpec.setDriftMetric(DriftMetric.JENSEN_SHANNON)
-    val windows = new java.util.ArrayList[Window]()
-    windows.add(new Window(30, TimeUnit.MINUTES))
-    driftSpec.setLookbackWindows(windows)
+    val DefaultTileSize = 30
+    val DefaultLookbackWindows = Seq(
+      new Window(30, TimeUnit.MINUTES),
+      new Window(1, TimeUnit.HOURS),
+      new Window(4, TimeUnit.HOURS)
+    )
+    
+    val driftSpec = new DriftSpec()
+      .setTileSize(new Window(DefaultTileSize, TimeUnit.MINUTES))
+      .setDriftMetric(DriftMetric.JENSEN_SHANNON)
+      .setLookbackWindows(DefaultLookbackWindows.asJava)
299-311: Enhance schema validation error message

While the validation is thorough, the error message could be more descriptive.
-    require(matching == expected, s"Schema mismatch: expected $expected, got $actual. Missing: $missing, Extra: $extra")
+    require(
+      matching == expected,
+      s"""Schema validation failed:
+         |Expected columns: ${expected.toSeq.sorted.mkString(", ")}
+         |Actual columns: ${actual.toSeq.sorted.mkString(", ")}
+         |Missing columns: ${missing.toSeq.sorted.mkString(", ")}
+         |Extra columns: ${extra.toSeq.sorted.mkString(", ")}
+         |""".stripMargin
+    )
api/thrift/api.thrift (3)
Line range hint 253-273: Add comprehensive documentation for TileSummary and TileSeriesKey structs.

The structures would benefit from detailed documentation explaining:

The purpose and use cases of each field

The relationship between these structures

The meaning of "inner" counts vs regular counts

The significance of percentiles and histograms in the context of drift detection
+/**
+ * Statistical summary of a data tile, containing various metrics used for drift detection.
+ * A tile represents a snapshot of data within a specific time window.
+ */
 struct TileSummary {
+  // Percentile distribution of numeric values
+  // Frequency distribution of categorical values
+  // Total number of records
+  // Number of null values

+  // Metrics specific to container types (lists, maps, etc.)

+  // Metrics specific to string values
 }

+/**
+ * Unique identifier for a tile series, combining column, slice, and grouping information
+ */
 struct TileSeriesKey {
275-292: Consider restructuring TileSummarySeries for better maintainability.

The current structure duplicates the fields from TileSummary. Consider using a list of TileSummary objects instead, which would:

Reduce code duplication

Make it easier to maintain (changes to TileSummary automatically reflect here)

Improve code readability
 // array of tuples of (TileSummary, timestamp) ==(pivot)==> TileSummarySeries
 struct TileSummarySeries {
-  1: optional list<list<double>> percentiles
-  2: optional map<string, list<i64>> histogram
-  3: optional list<i64> count
-  4: optional list<i64> nullCount
-  5: optional list<i64> innerCount
-  6: optional list<i64> innerNullCount
-  7: optional list<list<i32>> lengthPercentiles
-  8: optional list<list<i32>> stringLengthPercentiles
+  1: optional list<TileSummary> summaries
 }
344-356: Enhance documentation for DriftSpec configuration.

The DriftSpec structure would benefit from more detailed documentation explaining:

The relationship between tileSize and lookbackWindows

The impact of different drift metrics and their thresholds

Best practices for setting cardinality hints
 struct DriftSpec {
+    /**
+     * Additional dimensions for data summarization beyond column & slice.
+     * Currently limited to one slice.
+     */

+    /**
+     * Custom expressions for monitoring specific scenarios:
+     * - Conditional metrics: e.g., "IF(merchant = 'shopify', txn_amount, NULL)"
+     * - Derived ratios: e.g., "likes_over_dislikes = IF(dislikes > likes, 1, 0)"
+     */

+    /**
+     * Override automatic cardinality detection for specific columns.
+     * Useful when the training data doesn't reflect the production cardinality.
+     */

+    /**
+     * Size of the time window for computing tile summaries.
+     */

+    /**
+     * Windows for comparing current tile summary with historical summaries.
+     * Multiple windows allow detection of both short-term and long-term drift.
+     */

+    /**
+     * Default drift metric with industry-standard thresholds:
+     * - Moderate drift: 0.05 - 0.1
+     * - Severe drift: > 0.1
+     */
 }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 4cc9851 and 2eaced3.

📒 Files selected for processing (47)

.github/workflows/test_scala_and_python.yaml (2 hunks)
api/src/main/scala/ai/chronon/api/Builders.scala (2 hunks)
api/src/main/scala/ai/chronon/api/Constants.scala (2 hunks)
api/src/main/scala/ai/chronon/api/Extensions.scala (2 hunks)
api/src/main/scala/ai/chronon/api/ThriftJsonCodec.scala (2 hunks)
api/thrift/api.thrift (4 hunks)
build.sbt (1 hunks)
cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (2 hunks)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala (1 hunks)
hub/app/controllers/TimeSeriesController.scala (4 hunks)
hub/app/model/Model.scala (1 hunks)
hub/app/store/DynamoDBMonitoringStore.scala (1 hunks)
hub/conf/routes (1 hunks)
online/src/main/scala/ai/chronon/online/Api.scala (3 hunks)
online/src/main/scala/ai/chronon/online/Fetcher.scala (1 hunks)
online/src/main/scala/ai/chronon/online/FetcherBase.scala (2 hunks)
online/src/main/scala/ai/chronon/online/MetadataStore.scala (3 hunks)
online/src/main/scala/ai/chronon/online/stats/AssignIntervals.scala (1 hunks)
online/src/main/scala/ai/chronon/online/stats/Display.scala (1 hunks)
online/src/main/scala/ai/chronon/online/stats/DriftMetrics.scala (1 hunks)
online/src/main/scala/ai/chronon/online/stats/DriftStore.scala (1 hunks)
online/src/main/scala/ai/chronon/online/stats/PivotUtils.scala (1 hunks)
online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala (1 hunks)
online/src/test/scala/ai/chronon/online/test/stats/AssignIntervalsTest.scala (1 hunks)
online/src/test/scala/ai/chronon/online/test/stats/DriftMetricsTest.scala (1 hunks)
online/src/test/scala/ai/chronon/online/test/stats/PivotUtilsTest.scala (1 hunks)
spark/src/main/scala/ai/chronon/spark/Driver.scala (0 hunks)
spark/src/main/scala/ai/chronon/spark/LabelJoin.scala (2 hunks)
spark/src/main/scala/ai/chronon/spark/SparkSessionBuilder.scala (1 hunks)
spark/src/main/scala/ai/chronon/spark/stats/SummaryJob.scala (0 hunks)
spark/src/main/scala/ai/chronon/spark/stats/drift/DistanceMetrics.scala (0 hunks)
spark/src/main/scala/ai/chronon/spark/stats/drift/Expressions.scala (4 hunks)
spark/src/main/scala/ai/chronon/spark/stats/drift/Summarizer.scala (2 hunks)
spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala (1 hunks)
spark/src/test/scala/ai/chronon/spark/test/ChainingFetcherTest.scala (2 hunks)
spark/src/test/scala/ai/chronon/spark/test/ExternalSourcesTest.scala (2 hunks)
spark/src/test/scala/ai/chronon/spark/test/FetchStatsTest.scala (1 hunks)
spark/src/test/scala/ai/chronon/spark/test/FetcherTest.scala (5 hunks)
spark/src/test/scala/ai/chronon/spark/test/InMemoryKvStore.scala (3 hunks)
spark/src/test/scala/ai/chronon/spark/test/JoinTest.scala (0 hunks)
spark/src/test/scala/ai/chronon/spark/test/OnlineUtils.scala (1 hunks)
spark/src/test/scala/ai/chronon/spark/test/SchemaEvolutionTest.scala (2 hunks)
spark/src/test/scala/ai/chronon/spark/test/StreamingTest.scala (2 hunks)
spark/src/test/scala/ai/chronon/spark/test/bootstrap/DerivationTest.scala (1 hunks)
spark/src/test/scala/ai/chronon/spark/test/bootstrap/LogBootstrapTest.scala (1 hunks)
spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala (3 hunks)
spark/src/test/scala/ai/chronon/spark/test/stats/drift/PrepareData.scala (11 hunks)

💤 Files with no reviewable changes (4)

spark/src/main/scala/ai/chronon/spark/Driver.scala
spark/src/main/scala/ai/chronon/spark/stats/SummaryJob.scala
spark/src/main/scala/ai/chronon/spark/stats/drift/DistanceMetrics.scala
spark/src/test/scala/ai/chronon/spark/test/JoinTest.scala

🚧 Files skipped from review as they are similar to previous changes (28)

api/src/main/scala/ai/chronon/api/Builders.scala
api/src/main/scala/ai/chronon/api/Constants.scala
api/src/main/scala/ai/chronon/api/ThriftJsonCodec.scala
build.sbt
cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala
hub/app/model/Model.scala
hub/app/store/DynamoDBMonitoringStore.scala
online/src/main/scala/ai/chronon/online/Api.scala
online/src/main/scala/ai/chronon/online/Fetcher.scala
online/src/main/scala/ai/chronon/online/FetcherBase.scala
online/src/main/scala/ai/chronon/online/MetadataStore.scala
online/src/main/scala/ai/chronon/online/stats/AssignIntervals.scala
online/src/main/scala/ai/chronon/online/stats/DriftMetrics.scala
online/src/test/scala/ai/chronon/online/test/stats/AssignIntervalsTest.scala
online/src/test/scala/ai/chronon/online/test/stats/DriftMetricsTest.scala
online/src/test/scala/ai/chronon/online/test/stats/PivotUtilsTest.scala
spark/src/main/scala/ai/chronon/spark/LabelJoin.scala
spark/src/main/scala/ai/chronon/spark/SparkSessionBuilder.scala
spark/src/test/scala/ai/chronon/spark/test/ChainingFetcherTest.scala
spark/src/test/scala/ai/chronon/spark/test/ExternalSourcesTest.scala
spark/src/test/scala/ai/chronon/spark/test/FetchStatsTest.scala
spark/src/test/scala/ai/chronon/spark/test/FetcherTest.scala
spark/src/test/scala/ai/chronon/spark/test/InMemoryKvStore.scala
spark/src/test/scala/ai/chronon/spark/test/SchemaEvolutionTest.scala
spark/src/test/scala/ai/chronon/spark/test/StreamingTest.scala
spark/src/test/scala/ai/chronon/spark/test/bootstrap/DerivationTest.scala
spark/src/test/scala/ai/chronon/spark/test/bootstrap/LogBootstrapTest.scala

🧰 Additional context used

📓 Learnings (2)

spark/src/main/scala/ai/chronon/spark/stats/drift/Expressions.scala (1)

Learnt from: nikhil-zlai
PR: zipline-ai/chronon#17
File: spark/src/main/scala/ai/chronon/spark/stats/drift/Expressions.scala:76-102
Timestamp: 2024-11-12T09:38:33.532Z
Learning: In performance-critical sections of the codebase, prefer mutable data structures over immutable ones for better performance.

spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala (3)

Learnt from: chewy-zlai
PR: zipline-ai/chronon#50
File: spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala:37-40
Timestamp: 2024-11-12T09:38:33.532Z
Learning: Avoid using `Await.result` in production code; prefer handling `Future`s asynchronously when possible to prevent blocking.

Learnt from: chewy-zlai
PR: zipline-ai/chronon#62
File: spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala:9-10
Timestamp: 2024-11-12T09:38:33.532Z
Learning: In Spark applications, when defining serializable classes, passing an implicit `ExecutionContext` parameter can cause serialization issues. In such cases, it's acceptable to use `scala.concurrent.ExecutionContext.Implicits.global`.

Learnt from: nikhil-zlai
PR: zipline-ai/chronon#50
File: spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala:19-47
Timestamp: 2024-11-12T09:38:33.532Z
Learning: In Scala, the `grouped` method on collections returns an iterator, allowing for efficient batch processing without accumulating all records in memory.

🪛 actionlint

.github/workflows/test_scala_and_python.yaml

52-52: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-14.0", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-13.0", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "macos-12.0", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🔇 Additional comments (34)

hub/conf/routes (1)

Line range hint 1-26: Changes align well with PR objectives.

The removal of slice-specific endpoints (as mentioned in the previous review) and the introduction of new timeseries endpoints align well with the PR's drift metrics focus. Since this is a WIP PR, consider:

Adding integration tests for the new endpoints
Updating API documentation to reflect these changes

spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala (2)

16-21: Parameter requestsPerSecondPerExecutor is defined but not used

The rate limiting functionality is still marked as TODO and not implemented, despite having the parameter in place.

21-22: Consider serialization implications of implicit TableUtils

The class extends Serializable but takes an implicit TableUtils parameter. This could lead to serialization issues if TableUtils or its dependencies are not serializable.

.github/workflows/test_scala_and_python.yaml (2)

52-52: ⚠️ Potential issue

Critical: Invalid runner label will break CI

The runner label ubuntu-8_cores-32_gb is not a standard GitHub-hosted runner. This will cause the CI workflow to fail.

For GitHub-hosted runners with more resources, use:

ubuntu-latest-8-cores for 8 core runner
ubuntu-latest-16-cores for 16 core runner

Apply this diff to use a valid GitHub-hosted runner:

-    runs-on: ubuntu-8_cores-32_gb
+    runs-on: ubuntu-latest-8-cores

If you're using self-hosted runners:

Ensure the runner label is properly configured
Add an actionlint.yaml to specify custom runner labels

Likely invalid or redundant comment.

🧰 Tools

🪛 actionlint

52-52: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-14.0", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-13.0", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "macos-12.0", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

67-67: Verify the significant memory increase

The SBT max heap size has been increased from 8G to 24G, which is a significant 3x increase. While this might be necessary for the new drift metrics calculations:

Please provide justification for this large increase
Consider if this is the minimum required memory
Document any observed OOM issues that necessitated this change

✅ Verification successful

Based on the gathered information, I can now provide a final response:

Memory increase appears justified due to data-intensive operations

The codebase shows extensive use of memory-intensive Spark operations across multiple components:

Large-scale data processing with DataFrames/Datasets in drift metrics calculation (spark/src/main/scala/ai/chronon/spark/stats/drift/)
Multiple caching and persistence operations across various components
Complex join operations and analytics in test suites

The change history shows this was increased from 8G to 24G, and given the extensive data processing evident in the codebase, particularly around drift metrics and analytics, this memory increase appears necessary to handle the workload reliably.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for memory-related issues in recent test runs

# Look for OOM errors in recent workflow runs
gh run list --workflow=test_scala_and_python.yaml --limit 10 --json conclusion,url | \
  jq -r '.[] | select(.conclusion=="failure") | .url' | \
  while read -r url; do
    gh run view "$url" --log | grep -i "out of memory\|java.lang.OutOfMemoryError"
done

# Check test files for memory-intensive operations
rg -l "drift.*metrics" --type scala

Length of output: 334

Script:

#!/bin/bash
# Let's try a different approach to understand memory usage patterns

# Check git history for memory-related changes in the workflow file
git log -p -- .github/workflows/test_scala_and_python.yaml | grep -B5 -A5 "SBT_OPTS.*-Xmx"

# Look for large data processing or memory-intensive operations
rg -l "DataFrame|Dataset|collect|cache|persist" --type scala

# Search for test configurations and memory settings
fd "\.conf$|\.properties$|\.sbt$" --exec grep -l "memory|heap|Xmx|Xms" {} \;

# Look for potential memory-intensive test files
fd "Test\.scala$|Spec\.scala$" --exec grep -l "test\|benchmark\|performance" {} \;

Length of output: 16930

online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala (1)

90-91: Consider performance optimization for large datasets.

The current implementation creates a map of all summaries, which could be memory-intensive for large datasets. Consider implementing the single-pass solution mentioned in the TODO comment.

Let's check if there are any large datasets being processed:

online/src/main/scala/ai/chronon/online/stats/PivotUtils.scala (4)

1-15: LGTM! Clean import organization

Imports are well-structured with clear aliasing for Java types and proper Scala collection conversions.

77-79: LGTM! Clean pipe operator implementation

The PipeOps implementation provides a clean way to chain operations, improving code readability.

81-89: LGTM! Robust input handling and sorting

The implementation properly handles empty input and ensures consistent ordering through timestamp sorting.

122-130: LGTM! Consistent implementation pattern

The implementation follows the same robust pattern as the TileSummary pivot, ensuring consistent behavior across different types.

online/src/main/scala/ai/chronon/online/stats/DriftStore.scala (3)

1-27: LGTM: Class structure and imports are well-organized.

The class definition is clean with appropriate inheritance and constructor parameters.

148-159: LGTM: Clean implementation of summary series retrieval.

The method is well-structured and maintains a consistent pattern with getDriftSeries.

140-140: Verify startMs calculation with lookback window.

The current implementation subtracts lookback.millis from startMs. Please verify if this provides the correct time range for drift calculation.

✅ Verification successful

These scripts will help us understand:

How lookback windows are generally handled
Time window calculations patterns
The broader context of startMs usage
The DriftStore class implementation details

Based on the search results, I can now provide a conclusive response about the lookback window calculation verification:

The lookback window calculation is correctly implemented.

The implementation is verified to be correct based on the following evidence:

In TileDriftCalculator.scala, we can see the drift calculation logic that confirms this pattern:
- It uses lookBackMs = lookBackWindow.millis
- For each timestamp ts, it calculates previousTs = ts - lookBackMs
- This matches with the DriftStore's approach of fetching data from startMs - lookBack.millis
The implementation ensures that:
- The store fetches enough historical data by starting from startMs - lookBack.millis
- The calculator then uses this data to compute drifts by comparing each point with its corresponding lookback point
- The final drift series starts from the requested startMs (as seen in the filter condition ts >= startMs in TileDriftCalculator)

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for other usages of lookback window calculations
rg -l "lookback.millis" | xargs rg "startMs.*lookback"

Length of output: 54

Script:

#!/bin/bash
# Let's try a broader search for lookback window calculations
rg -A 5 "lookBack"

Length of output: 6329

Script:

#!/bin/bash
# Search for startMs usage patterns
rg -A 5 "startMs.*-"

Length of output: 1418

Script:

#!/bin/bash
# Search for time window calculations in the codebase
ast-grep --pattern 'val $_ = $_ - $_'

Length of output: 1816

Script:

#!/bin/bash
# Let's check the context of the DriftStore class
rg -A 10 "class DriftStore"

Length of output: 1279

online/src/main/scala/ai/chronon/online/stats/Display.scala (1)

1-6: LGTM: Imports are appropriate

The imports are correctly chosen for web server implementation and HTML generation using cask and scalatags.

spark/src/test/scala/ai/chronon/spark/test/OnlineUtils.scala (1)

159-159: Verify the impact of metadata dataset changes

The constant change from ChrononMetadataKey to MetadataDataset might affect other parts of the system.

Let's verify the usage of these constants:

✅ Verification successful

Constant change from ChrononMetadataKey to MetadataDataset is consistent across the codebase

The search results show:

No references to the old constant ChrononMetadataKey exist in the codebase
The new constant MetadataDataset is consistently used across all relevant files
The constant is properly defined in api/src/main/scala/ai/chronon/api/Constants.scala as "chronon_metadata"
All KVStore creation patterns using this constant are aligned

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any remaining references to the old constant
rg "ChrononMetadataKey"

# Search for all usages of the new constant to ensure consistency
rg "MetadataDataset"

# Search for any KVStore creation patterns to ensure alignment
ast-grep --pattern 'inMemoryKvStore.create($CONST)'

Length of output: 4108

spark/src/main/scala/ai/chronon/spark/stats/drift/Expressions.scala (2)

Line range hint 74-98: LGTM! Well-implemented mutable collection usage

The implementation correctly uses mutable collections for better performance, as suggested by previous learnings. The getOrElseUpdate pattern is an idiomatic Scala approach for lazy initialization.

73-73: Verify the impact of TileRow's breaking change

The change from TileSummaries to TileSummary in the case class is a breaking change that could affect downstream code.

spark/src/main/scala/ai/chronon/spark/stats/drift/Summarizer.scala (2)

5-5: LGTM: Import consolidation looks good

The wildcard import from ai.chronon.api._ simplifies the import section while maintaining access to required entities.

258-258: LGTM: Serializer type simplification

The removal of explicit type declaration for serializer is safe here as Scala's type inference will correctly determine the type from the imported compactSerializer.

Let's verify the imported serializer's usage across the codebase:

✅ Verification successful

Let me analyze the results and generate additional verification scripts to ensure we have a complete understanding.

Type inference for serializer is correct and safe

The verification confirms that:

The compactSerializer is correctly imported from DriftStore
DriftStore.compactSerializer returns a SerializableSerializer type
The serializer is used in a consistent way for serializing tile rows in the RDD transformation
There's even a documented reason for using TBinaryProtocol.Factory() instead of compact protocol

The removal of explicit type declaration is indeed safe as Scala will infer the correct SerializableSerializer type from DriftStore.compactSerializer.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of compactSerializer from DriftStore
# Expected: All serializer instantiations should use DriftStore.compactSerializer

# Search for any remaining SerializableSerializer declarations
rg "SerializableSerializer"

# Search for compactSerializer usage
rg "compactSerializer"

# Search for any custom serializer implementations
ast-grep --pattern 'class $name extends Serializer { $$$  }'

Length of output: 1353

Script:

#!/bin/bash
# Let's verify the actual usage context of serializer in Summarizer.scala
rg -B 5 -A 5 "val serializer = compactSerializer" spark/src/main/scala/ai/chronon/spark/stats/drift/Summarizer.scala

# Check if there are any other serializer-related imports in Summarizer
rg "import.*[Ss]erializer" spark/src/main/scala/ai/chronon/spark/stats/drift/Summarizer.scala

# Verify the type signature of compactSerializer in DriftStore
rg -B 2 -A 2 "def compactSerializer" online/src/main/scala/ai/chronon/online/stats/DriftStore.scala

Length of output: 1043

spark/src/test/scala/ai/chronon/spark/test/stats/drift/PrepareData.scala (4)

7-10: LGTM: Clean imports and well-structured field declarations

The new imports and field declarations are well-organized and follow Scala naming conventions.

Also applies to: 43-46

142-151: LGTM: Well-structured join configuration

The join configuration is clean and properly integrates the drift metrics functionality through metadata.

Line range hint 241-293: LGTM: Clean schema implementation with clear categorization

The schema implementation is well-structured with:

Reusable field prefixing through implicit class
Clear categorization of fields with helpful comments
Consistent naming conventions

176-177: Previous concern about negative maxGap is still valid

The potential for negative maxGap leading to runtime exceptions hasn't been addressed yet.

api/thrift/api.thrift (1)

295-314: ⚠️ Potential issue

Fix typos and standardize naming pattern.

Issues found:

Typo in field name: percentilDrift should be percentileDrift
Inconsistent naming pattern: some fields use 'Change' while others use 'Drift'

Apply this diff to fix the issues:

 struct TileDrift {
-  1: optional double percentilDrift
+  1: optional double percentileDrift
-  3: optional double countChangePercent
-  4: optional double nullRatioChangePercent
+  3: optional double countDrift
+  4: optional double nullRatioDrift

-  5: optional double innerCountChangePercent
-  6: optional double innerNullCountChangePercent
+  5: optional double innerCountDrift
+  6: optional double innerNullCountDrift
 }

api/src/main/scala/ai/chronon/api/Extensions.scala (7)

161-169: LGTM! Well-structured null handling.

The implementation uses idiomatic Scala patterns for safe null handling and provides a sensible default through Constants.

891-901: Reiterating past review comment: Handle potential null values.

902-912: Reiterating past review comment: Handle potential null names.

914-921: LGTM! Clean functional implementation.

The method follows Scala best practices with proper null handling and functional composition.

922-937: LGTM! Well-structured column name generation.

The implementation properly handles null cases and follows a clear pattern for generating column names.

939-947: Reiterating past review comment: Avoid mapValues.

949-953: Reiterating past review comment: Ensure safe collection access.

hub/app/controllers/TimeSeriesController.scala (3)

146-146: Previous comment still valid: Inconsistent error messages for drift algorithms

The error message here has been updated, but similar error messages elsewhere in the file remain outdated (e.g., line 178). This inconsistency could confuse users about available drift algorithm options. Please update all similar error messages throughout the file to consistently show "Expect JSD, PSI, or Hellinger".

237-240: Previous comment still valid: Improve JSON structure documentation

The comments describing the JSON structure are not properly formatted and lack context. Consider replacing them with properly formatted and detailed documentation to enhance code readability and maintainability.

283-290: Previous comment still valid: Improve error handling in parseAlgorithm method

The current implementation has several issues:

Using throw in a parse method is not idiomatic Scala.
Case matching could be simplified.
Error message is inconsistent with other messages in the file.

Please consider refactoring the method to address these concerns.

coderabbitai · 2024-11-20T05:40:01Z

spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala

+    try {
+      api.genKvStore.create(datasetName)
+    } catch {
+      // swallows all exceptions right now
+      // TODO: swallow only already existing exception - move this into the kvstore.createIfNotExists
+      case e: Exception => e.printStackTrace()
+    }


⚠️ Potential issue

Improve exception handling in dataset creation

The current implementation swallows all exceptions, which could hide critical issues. The TODO comment suggests this needs to be refined.

Consider this approach:

- try { - api.genKvStore.create(datasetName) - } catch { - // swallows all exceptions right now - // TODO: swallow only already existing exception - move this into the kvstore.createIfNotExists - case e: Exception => e.printStackTrace() - } + try { + api.genKvStore.create(datasetName) + } catch { + case e: KVStoreException if e.getMessage.contains("already exists") => // Dataset already exists, ignore + case e: Exception => + throw new RuntimeException(s"Failed to create dataset $datasetName: ${e.getMessage}", e) + }

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2024-11-20T05:40:01Z

spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala

+    summaryDF.rdd.foreachPartition((rows: Iterator[Row]) => {
      val kvStore: KVStore = api.genKvStore

-      val putRequests = new scala.collection.mutable.ArrayBuffer[PutRequest]
-      for (row <- rows) {
-        putRequests += PutRequest(
-          Option(row.getAs[Array[Byte]]("keyBytes")).getOrElse(Array.empty[Byte]),
-          Option(row.getAs[Array[Byte]]("valueBytes")).getOrElse(Array.empty[Byte]),
-          statsTableName,
-          Option(row.getAs[Long]("timestamp"))
-        )
+      def toPutRequest(row: Row): PutRequest = {
+        if (row.isNullAt(keyIndex)) return null
+        val keyBytes = row.getAs[Array[Byte]](keyIndex)
+        val valueBytes = if (row.isNullAt(valueIndex)) Array.empty[Byte] else row.getAs[Array[Byte]](valueIndex)
+        val timestamp =
+          if (timestampIndex < 0 || row.isNullAt(timestampIndex)) None else Some(row.getAs[Long](timestampIndex))
+        PutRequest(keyBytes, valueBytes, datasetName, timestamp)
      }

-      val futureResults = putRequests.grouped(putsPerRequest).map { batch =>
-        kvStore
-          .multiPut(batch.toList)
-          .map { result =>
-            if (!result.forall(identity)) {
-              throw new RuntimeException(s"Failed to put ${result.count(!_)} records")
-            }
-          }
-          .recover {
-            case e =>
-              throw new RuntimeException(s"Failed to put batch: ${e.getMessage}", e)
-          }
-      }
+      // TODO implement rate limiting
+      val putResponses: Iterator[Future[Seq[Boolean]]] = rows
+        .grouped(putsPerRequest)
+        .map { _.map(toPutRequest).filter(_ != null).toArray }
+        .map { requests => kvStore.multiPut(requests) }

-      val aggregatedFuture = Future.sequence(futureResults.toSeq)
+      val aggregatedFuture = Future.sequence(putResponses.toSeq).map(_.flatten)
      aggregatedFuture.onComplete {
-        case scala.util.Success(_) => // All operations completed successfully
-        case scala.util.Failure(e: IllegalArgumentException) =>
-          throw new IllegalArgumentException(s"Invalid request data: ${e.getMessage}", e)
-        case scala.util.Failure(e: java.io.IOException) =>
-          throw new RuntimeException(s"KVStore I/O error: ${e.getMessage}", e)
-        case scala.util.Failure(e) =>
+        case Success(_) => // All operations completed successfully
+        case Failure(e) =>
          throw new RuntimeException(s"Failed to upload summary statistics: ${e.getMessage}", e)
      }
    })


💡 Codebase verification

⚠️ Potential issue

Async operations are safe, but rate limiting should be implemented

The review comment's concerns about async operations and rate limiting can be addressed based on the codebase analysis:

The async operations within foreachPartition are actually safe in the current implementation. The Future.sequence followed by onComplete ensures proper handling of failures, and the task won't complete until all futures are resolved.

Rate limiting should indeed be implemented, following the pattern used in DynamoDBKVStoreImpl which demonstrates a proper implementation using RateLimiter with separate limiters for read and write operations.

The error handling in the current implementation is sufficient for the use case, as it captures and propagates all failures with proper context.

Suggested implementation for rate limiting:

private val rateLimiter = RateLimiter.create(requestsPerSecondPerExecutor) // In the existing code, before kvStore.multiPut: rateLimiter.acquire(1)

🔗 Analysis chain

Ensure completion of async operations and implement rate limiting

The current implementation has several concerns:

Async operations within foreachPartition might not complete before the task ends

Rate limiting is not implemented despite having the parameter

Error handling could be more specific

Consider this implementation:

import scala.concurrent.duration._ import com.google.common.util.concurrent.RateLimiter summaryDF.rdd.foreachPartition((rows: Iterator[Row]) => { val kvStore: KVStore = api.genKvStore val rateLimiter = RateLimiter.create(requestsPerSecondPerExecutor) def toPutRequest(row: Row): PutRequest = { // ... existing toPutRequest implementation ... } try { val putResponses = rows .grouped(putsPerRequest) .map { batch => // Rate limit the requests rateLimiter.acquire(1) val requests = batch.map(toPutRequest).filter(_ != null).toArray val future = kvStore.multiPut(requests) // Wait for each batch to complete with timeout Await.result(future, 30.seconds) } .toSeq // Ensure all batches completed successfully if (putResponses.exists(_.exists(!_))) { throw new RuntimeException("Some put requests failed") } } catch { case e: TimeoutException => throw new RuntimeException(s"Request timed out: ${e.getMessage}", e) case e: IllegalArgumentException => throw new IllegalArgumentException(s"Invalid request data: ${e.getMessage}", e) case e: Exception => throw new RuntimeException(s"Failed to upload summary statistics: ${e.getMessage}", e) } })

This implementation:

Uses rate limiting to control request rate

Ensures each batch completes before processing the next

Provides more specific error handling

Sets a timeout for each batch to prevent indefinite waiting

Note: While we generally avoid Await.result, it's necessary here to ensure completion within foreachPartition.

Run this script to check for potential rate limiting implementations in similar classes:

Let me check the rate limiting implementation in the DynamoDB store to better understand the existing patterns.

Let me check one more thing to understand how the async operations are handled in the DynamoDB implementation.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Search for rate limiting implementations in similar classes rg -l "RateLimiter" --type scala ast-grep --pattern 'class $_ extends $_ { $$$ val rateLimiter = RateLimiter.create($_) $$$ }'

Length of output: 211

Script:

#!/bin/bash # Check the rate limiting implementation in DynamoDBKVStoreImpl rg -A 10 "RateLimiter" cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala # Also check if there are any other rate limiting related configurations rg -i "rate.*limit|throttl" --type scala

Length of output: 5167

Script:

#!/bin/bash # Check how futures and async operations are handled in DynamoDBKVStoreImpl rg -B 5 -A 5 "Future.sequence" cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala # Also check if there are any await patterns used rg "Await" --type scala

Length of output: 7960

coderabbitai · 2024-11-20T05:40:01Z

online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala

+    if (a.isSetCount && b.isSetCount && b.getCount != 0.0) {
+      // preserving sign, since direction of change is usually relevant
+      val change = (a.getCount - b.getCount).toDouble / b.getCount
+      result.setCountChangePercent(change * 100)
+    }


🛠️ Refactor suggestion

Enhance division by zero protection and error handling.

The code checks for zero in some denominators but not all. Also, consider adding logging for skipped calculations.

if (a.isSetCount && b.isSetCount && b.getCount != 0.0) { val change = (a.getCount - b.getCount).toDouble / b.getCount result.setCountChangePercent(change * 100) +} else if (a.isSetCount && b.isSetCount) { + log.warn(s"Skipping count change calculation due to zero count in before summary") } if (a.isSetNullCount && b.isSetNullCount && b.getNullCount != 0) { + if (a.getCount == 0.0 || b.getCount == 0.0) { + log.warn("Skipping null ratio calculation due to zero count") + } else { val aNullRatio = a.getNullCount.toDouble / a.getCount.toDouble val bNullRatio = b.getNullCount.toDouble / b.getCount.toDouble val change = (aNullRatio - bNullRatio) / bNullRatio result.setNullRatioChangePercent(change * 100) + } }

Also applies to: 45-51, 53-57, 59-65

coderabbitai · 2024-11-20T05:40:01Z

online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala

+  @inline
+  private def toArray(l: java.util.List[java.lang.Double]): Array[Double] = {
+    l.iterator().toScala.map(_.toDouble).toArray
+  }
+
+  @inline
+  private def toArray(l: java.util.List[java.lang.Integer]): Array[Int] = {
+    l.iterator().toScala.map(_.toInt).toArray
+  }


🛠️ Refactor suggestion

Add null checks to array conversion methods.

The toArray methods should handle null input gracefully to prevent NullPointerException.

@inline private def toArray(l: java.util.List[java.lang.Double]): Array[Double] = { + if (l == null) return Array.empty[Double] l.iterator().toScala.map(_.toDouble).toArray } @inline private def toArray(l: java.util.List[java.lang.Integer]): Array[Int] = { + if (l == null) return Array.empty[Int] l.iterator().toScala.map(_.toInt).toArray }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@inline

private def toArray(l: java.util.List[java.lang.Double]): Array[Double] = {

l.iterator().toScala.map(_.toDouble).toArray

}

@inline

private def toArray(l: java.util.List[java.lang.Integer]): Array[Int] = {

l.iterator().toScala.map(_.toInt).toArray

}

@inline

private def toArray(l: java.util.List[java.lang.Double]): Array[Double] = {

if (l == null) return Array.empty[Double]

l.iterator().toScala.map(_.toDouble).toArray

}

@inline

private def toArray(l: java.util.List[java.lang.Integer]): Array[Int] = {

if (l == null) return Array.empty[Int]

l.iterator().toScala.map(_.toInt).toArray

}

coderabbitai · 2024-11-20T05:40:02Z

online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala

+  }
+
+  // a is after, b is before
+  def between(a: TileSummary, b: TileSummary, metric: DriftMetric = DriftMetric.JENSEN_SHANNON): TileDrift = {


🛠️ Refactor suggestion

Add comprehensive input validation.

The method should validate input parameters at the start.

def between(a: TileSummary, b: TileSummary, metric: DriftMetric = DriftMetric.JENSEN_SHANNON): TileDrift = { + require(a != null, "After TileSummary (a) cannot be null") + require(b != null, "Before TileSummary (b) cannot be null") + require(metric != null, "DriftMetric cannot be null")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def between(a: TileSummary, b: TileSummary, metric: DriftMetric = DriftMetric.JENSEN_SHANNON): TileDrift = {

def between(a: TileSummary, b: TileSummary, metric: DriftMetric = DriftMetric.JENSEN_SHANNON): TileDrift = {

require(a != null, "After TileSummary (a) cannot be null")

require(b != null, "Before TileSummary (b) cannot be null")

require(metric != null, "DriftMetric cannot be null")

coderabbitai · 2024-11-20T05:40:02Z

spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala

+    driftSeries.foreach{s => println(s"${s.getKey.getColumn}: ${s.getPercentilDriftSeries.toScala}")}
+
+    println("Drift series fetched successfully".green)


🛠️ Refactor suggestion

Add assertions for drift calculation results

The test prints drift series results but doesn't validate them. Add assertions to verify the expected behavior.

- driftSeries.foreach{s => println(s"${s.getKey.getColumn}: ${s.getPercentilDriftSeries.toScala}")} + driftSeries.foreach { s => + logger.info(s"${s.getKey.getColumn}: ${s.getPercentilDriftSeries.toScala}") + // Add assertions + s.getKey.getColumn should not be null + s.getPercentilDriftSeries should not be empty + // Verify drift values are within expected range + s.getPercentilDriftSeries.toScala.foreach { drift => + drift should be >= 0.0 + drift should be <= 1.0 + } + } - println("Drift series fetched successfully".green) + logger.info("Drift series fetched successfully")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

driftSeries.foreach{s => println(s"${s.getKey.getColumn}: ${s.getPercentilDriftSeries.toScala}")}

println("Drift series fetched successfully".green)

driftSeries.foreach { s =>

logger.info(s"${s.getKey.getColumn}: ${s.getPercentilDriftSeries.toScala}")

// Add assertions

s.getKey.getColumn should not be null

s.getPercentilDriftSeries should not be empty

// Verify drift values are within expected range

s.getPercentilDriftSeries.toScala.foreach { drift =>

drift should be >= 0.0

drift should be <= 1.0

}

}

logger.info("Drift series fetched successfully")

coderabbitai · 2024-11-20T05:40:02Z

spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala

+    val startMs = PartitionSpec.daily.epochMillis("2023-01-01")
+    val endMs = PartitionSpec.daily.epochMillis("2023-01-29")
+    val summariesFuture = driftStore.getSummaries(join, Some(startMs), Some(endMs))
+    val summaries = Await.result(summariesFuture, Duration.create(10, TimeUnit.SECONDS))
+    println(summaries)
+
+    // fetch drift series
+    val driftSeriesFuture = driftStore.getDriftSeries(
+      join.metaData.nameToFilePath,
+      DriftMetric.JENSEN_SHANNON,
+      lookBack = new Window(7, chronon.api.TimeUnit.DAYS),
+      startMs,
+      endMs
+    )
+    val driftSeries = Await.result(driftSeriesFuture.get, Duration.create(10, TimeUnit.SECONDS))
+


⚠️ Potential issue

Improve async operation handling

The current implementation has several issues with async operations:

Hardcoded timeout duration

No error handling for failed futures

Missing validation of results

+ private val DefaultTimeout = Duration(10, TimeUnit.SECONDS) + - val startMs = PartitionSpec.daily.epochMillis("2023-01-01") - val endMs = PartitionSpec.daily.epochMillis("2023-01-29") + val startMs = PartitionSpec.daily.epochMillis(TestConfig.StartDate) + val endMs = PartitionSpec.daily.epochMillis(TestConfig.EndDate) val summariesFuture = driftStore.getSummaries(join, Some(startMs), Some(endMs)) - val summaries = Await.result(summariesFuture, Duration.create(10, TimeUnit.SECONDS)) + val summaries = try { + Await.result(summariesFuture, DefaultTimeout) + } catch { + case e: Exception => + fail(s"Failed to fetch summaries: ${e.getMessage}") + } + summaries should not be null + summaries should not be empty val driftSeriesFuture = driftStore.getDriftSeries( join.metaData.nameToFilePath, DriftMetric.JENSEN_SHANNON, lookBack = new Window(7, chronon.api.TimeUnit.DAYS), startMs, endMs ) - val driftSeries = Await.result(driftSeriesFuture.get, Duration.create(10, TimeUnit.SECONDS)) + val driftSeries = try { + Await.result(driftSeriesFuture.get, DefaultTimeout) + } catch { + case e: Exception => + fail(s"Failed to fetch drift series: ${e.getMessage}") + } + driftSeries should not be empty

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

val startMs = PartitionSpec.daily.epochMillis("2023-01-01")

val endMs = PartitionSpec.daily.epochMillis("2023-01-29")

val summariesFuture = driftStore.getSummaries(join, Some(startMs), Some(endMs))

val summaries = Await.result(summariesFuture, Duration.create(10, TimeUnit.SECONDS))

println(summaries)

// fetch drift series

val driftSeriesFuture = driftStore.getDriftSeries(

join.metaData.nameToFilePath,

DriftMetric.JENSEN_SHANNON,

lookBack = new Window(7, chronon.api.TimeUnit.DAYS),

startMs,

endMs

)

val driftSeries = Await.result(driftSeriesFuture.get, Duration.create(10, TimeUnit.SECONDS))

private val DefaultTimeout = Duration(10, TimeUnit.SECONDS)

val startMs = PartitionSpec.daily.epochMillis(TestConfig.StartDate)

val endMs = PartitionSpec.daily.epochMillis(TestConfig.EndDate)

val summariesFuture = driftStore.getSummaries(join, Some(startMs), Some(endMs))

val summaries = try {

Await.result(summariesFuture, DefaultTimeout)

} catch {

case e: Exception =>

fail(s"Failed to fetch summaries: ${e.getMessage}")

}

summaries should not be null

summaries should not be empty

// fetch drift series

val driftSeriesFuture = driftStore.getDriftSeries(

join.metaData.nameToFilePath,

DriftMetric.JENSEN_SHANNON,

lookBack = new Window(7, chronon.api.TimeUnit.DAYS),

startMs,

endMs

)

val driftSeries = try {

Await.result(driftSeriesFuture.get, DefaultTimeout)

} catch {

case e: Exception =>

fail(s"Failed to fetch drift series: ${e.getMessage}")

}

driftSeries should not be empty

coderabbitai · 2024-11-20T05:40:02Z

online/src/main/scala/ai/chronon/online/stats/Display.scala

+  private def generateChartJs(chart: Chart, chartId: String): String = {
+    val data = chart.seriesList.map(_.series)
+    val xData = chart.x.map(_.toString)
+    chart.seriesList.map(_.name)
+
+    val seriesConfig = chart.seriesList.map(s => s"""{
+         |  label: "${s.name}",
+         |  stroke: "rgb(${scala.util.Random.nextInt(255)}, ${scala.util.Random.nextInt(255)}, ${scala.util.Random.nextInt(255)})"
+         |
+         |}""".stripMargin).mkString(",\n")
+
+    val thresholdLines = (chart.moderateThreshold.map(t => s"""
+         |{
+         |  label: "Moderate Threshold",
+         |  value: $t,
+         |  stroke: "#ff9800",
+         |  style: [2, 2]
+         |}""".stripMargin) ++
+      chart.severeThreshold.map(t => s"""
+           |{
+           |  label: "Severe Threshold",
+           |  value: $t,
+           |  stroke: "#f44336",
+           |  style: [2, 2]
+           |}""".stripMargin)).mkString(",")
+
+    s"""
+       |new uPlot({
+       |  title: "${chart.name}",
+       |  id: "$chartId",
+       |  class: "chart",
+       |  width: 800,
+       |  height: 400,
+       |  scales: {
+       |    x: {
+       |      time: false,
+       |    }
+       |  },
+       |  series: [
+       |    {},
+       |    $seriesConfig
+       |  ],
+       |  axes: [
+       |    {},
+       |    {
+       |      label: "Value",
+       |      grid: true,
+       |    }
+       |  ],
+       |  plugins: [
+       |    {
+       |      hooks: {
+       |        draw: u => {
+       |          ${if (thresholdLines.nonEmpty)
+      s"""const lines = [$thresholdLines];
+         |          for (const line of lines) {
+         |            const scale = u.scales.y;
+         |            const y = scale.getPos(line.value);
+         |
+         |            u.ctx.save();
+         |            u.ctx.strokeStyle = line.stroke;
+         |            u.ctx.setLineDash(line.style);
+         |
+         |            u.ctx.beginPath();
+         |            u.ctx.moveTo(u.bbox.left, y);
+         |            u.ctx.lineTo(u.bbox.left + u.bbox.width, y);
+         |            u.ctx.stroke();
+         |
+         |            u.ctx.restore();
+         |          }""".stripMargin
+    else ""}
+       |        }
+       |      }
+       |    }
+       |  ]
+       |}, [${xData.mkString("\"", "\",\"", "\"")}, ${data
+      .map(_.mkString(","))
+      .mkString("[", "],[", "]")}], document.getElementById("$chartId"));
+       |""".stripMargin
+  }


⚠️ Potential issue

Add input validation and escape special characters

The chart generation logic has potential issues:

Missing validation for array lengths between x-axis and series data

Names used in JS string interpolation aren't escaped, potentially leading to XSS vulnerabilities

Add input validation and escaping:

private def generateChartJs(chart: Chart, chartId: String): String = { // Validate data require(chart.x.length == chart.seriesList.head.series.length, "X-axis length must match series length") chart.seriesList.foreach(s => require(s.series.length == chart.x.length, s"Series ${s.name} length must match x-axis length")) // Escape special characters def escapeJs(s: String): String = s.replace("\"", "\\\"").replace("'", "\\'") val escapedChartName = escapeJs(chart.name) val escapedSeriesNames = chart.seriesList.map(s => escapeJs(s.name)) // Rest of the implementation... }

coderabbitai · 2024-11-20T05:40:02Z

online/src/main/scala/ai/chronon/online/stats/Display.scala

+  def serve(board: Board, portVal: Int = 9032): Unit = {
+
+    object Server extends cask.MainRoutes {
+      @get("/")
+      def index() = {
+        val page = html(
+          head(
+            title(board.name),
+            script(src := "https://unpkg.com/[email protected]/dist/uPlot.iife.min.js"),
+            link(rel := "stylesheet", href := "https://unpkg.com/[email protected]/dist/uPlot.min.css"),
+            tag("style")("""
+                           |body { font-family: Arial, sans-serif; margin: 20px; }
+                           |.tab { display: none; }
+                           |.tab.active { display: block; }
+                           |.tab-button { padding: 10px 20px; margin-right: 5px; cursor: pointer; }
+                           |.tab-button.active { background-color: #ddd; }
+                           |.section { margin: 20px 0; }
+                           |.chart { margin: 20px 0; }
+              """.stripMargin)
+          ),
+          body(
+            h1(board.name),
+            div(cls := "tabs")(
+              board.tabList.map(tab =>
+                button(
+                  cls := "tab-button",
+                  onclick := s"showTab('${tab.name}')",
+                  tab.name
+                ))
+            ),
+            board.tabList.map(tab =>
+              div(cls := "tab", id := tab.name)(
+                tab.sectionList.map(section =>
+                  div(cls := "section")(
+                    h2(section.name),
+                    section.charts.map(chart =>
+                      div(cls := "chart")(
+                        div(id := s"${tab.name}-${section.name}-${chart.name}".replaceAll("\\s+", "-"))
+                      ))
+                  ))
+              )),
+            script(raw("""
+                         |function showTab(tabName) {
+                         |  document.querySelectorAll('.tab').forEach(tab => {
+                         |    tab.style.display = tab.id === tabName ? 'block' : 'none';
+                         |  });
+                         |  document.querySelectorAll('.tab-button').forEach(button => {
+                         |    button.classList.toggle('active', button.textContent === tabName);
+                         |  });
+                         |}
+                         |
+                         |// Show first tab by default
+                         |document.querySelector('.tab-button').click();
+              """.stripMargin)),
+            script(
+              raw(
+                board.tabList
+                  .flatMap(tab =>
+                    tab.sectionList.flatMap(section =>
+                      section.charts.map(chart =>
+                        generateChartJs(chart, s"${tab.name}-${section.name}-${chart.name}".replaceAll("\\s+", "-")))))
+                  .mkString("\n")
+              ))
+          )
+        )
+
+//        page.render
+
+        cask.Response(
+          page.render,
+          headers = Seq("Content-Type" -> "text/html")
+        )
+      }
+
+      override def host: String = "0.0.0.0"
+      override def port: Int = portVal
+
+      initialize()
+    }
+
+    Server.main(Array())
+  }


⚠️ Potential issue

Add security headers and restrict binding address

The server implementation has security concerns:

Binding to 0.0.0.0 exposes the server on all network interfaces

Missing security headers like Content-Security-Policy

Consider these security improvements:

def serve(board: Board, portVal: Int = 9032, host: String = "127.0.0.1"): Unit = { object Server extends cask.MainRoutes { override def host: String = host @get("/") def index() = { cask.Response( page.render, headers = Seq( "Content-Type" -> "text/html", "Content-Security-Policy" -> "default-src 'self' unpkg.com; script-src 'self' 'unsafe-inline' unpkg.com", "X-Content-Type-Options" -> "nosniff", "X-Frame-Options" -> "DENY" ) ) } // ... } }

coderabbitai · 2024-11-20T05:40:02Z

api/thrift/api.thrift

+// PivotUtils.pivot(Array[(Long, TileDrift)])  = TileDriftSeries
+// used in front end after this is computed
+struct TileDriftSeries {
+  1: optional list<double> percentilDriftSeries
+  2: optional list<double> histogramDriftSeries
+  3: optional list<double> countChangePercentSeries
+  4: optional list<double> nullRatioChangePercentSeries
+
+  5: optional list<double> innerCountChangePercentSeries
+  6: optional list<double> innerNullCountChangePercentSeries
+  7: optional list<double> lengthPercentilesDriftSeries
+  8: optional list<double> stringLengthPercentilesDriftSeries
+
+  200: optional list<i64> timestamps
+
+  300: optional TileSeriesKey key
+}


🛠️ Refactor suggestion

Restructure TileDriftSeries and fix naming inconsistencies.

Consider using a list of TileDrift objects for better maintainability

Fix the naming inconsistencies to match the changes in TileDrift

-// PivotUtils.pivot(Array[(Long, TileDrift)]) = TileDriftSeries -// used in front end after this is computed +/** + * Time series of drift metrics, created by pivoting (TileDrift, timestamp) tuples + * using PivotUtils. Used by the frontend for visualization. + */ struct TileDriftSeries { - 1: optional list<double> percentilDriftSeries - 2: optional list<double> histogramDriftSeries - 3: optional list<double> countChangePercentSeries - 4: optional list<double> nullRatioChangePercentSeries - 5: optional list<double> innerCountChangePercentSeries - 6: optional list<double> innerNullCountChangePercentSeries - 7: optional list<double> lengthPercentilesDriftSeries - 8: optional list<double> stringLengthPercentilesDriftSeries + 1: optional list<TileDrift> drifts }

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (10)

spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala (1)
Line range hint 35-44: Replace println with proper test logging

The showTable helper method uses println for debugging output. Consider using ScalaTest's built-in logging or a proper logging framework for better test output management.
  def showTable(name: String)(implicit tableUtils: TableUtils): Unit = {
-   println(s"Showing table $name".yellow)
+   info(s"Showing table $name")
    val df = tableUtils.loadTable(name)
    val maxColNameLength = df.schema.fieldNames.map(_.length).max
    def pad(s: String): String = s.padTo(maxColNameLength, ' ')
    df.schema.fields.foreach { f =>
-     println(s"    ${pad(f.name)} : ${f.dataType.typeName}".yellow)
+     info(s"    ${pad(f.name)} : ${f.dataType.typeName}")
    }
    df.show(10, truncate = false)
  }
spark/src/main/scala/ai/chronon/spark/stats/drift/Summarizer.scala (2)
5-5: Consider using explicit imports instead of wildcard import

While import ai.chronon.api._ consolidates multiple imports, explicit imports make dependencies clearer and help prevent naming conflicts.
-import ai.chronon.api._
+import ai.chronon.api.{ColorString, Extensions, /* other specific imports */}
Line range hint 143-143: Consider implementing persistence for cardinality calculations

The TODO comment indicates missing persistence for cardinality calculations. This could lead to unnecessary recalculations and performance impact, especially with large datasets.

Consider:

Implementing a caching mechanism using the kvstore

Adding cache invalidation strategy based on data changes

Including versioning for cached cardinality maps
api/thrift/api.thrift (3)
253-256: Enhance documentation for the TileSummary struct.

The comments are too brief and don't fully explain the purpose and usage of this important struct. Consider adding comprehensive documentation that explains:

The relationship between categorical and numerical types

The meaning of inner values and container types

Examples of when to use different fields

319-335: Consider restructuring TileDriftSeries for better maintainability.

The current implementation duplicates fields from TileDrift, which could lead to maintenance issues. Additionally, the pivoting process needs better documentation.

Consider:

Using a list of TileDrift objects instead of individual series

Adding comprehensive documentation about the pivoting process

Including examples of how the pivoting affects the data structure
 struct TileDriftSeries {
-  1: optional list<double> percentilDriftSeries
-  2: optional list<double> histogramDriftSeries
-  3: optional list<double> countChangePercentSeries
-  4: optional list<double> nullRatioChangePercentSeries
-  5: optional list<double> innerCountChangePercentSeries
-  6: optional list<double> innerNullCountChangePercentSeries
-  7: optional list<double> lengthPercentilesDriftSeries
-  8: optional list<double> stringLengthPercentilesDriftSeries
+  1: optional list<TileDrift> drifts
 }
354-359: Move threshold documentation closer to implementation.

The threshold documentation currently exists in the DriftMetric enum comment but would be more useful here where the drift metric is actually used.

Add documentation about thresholds:
     // the current tile summary will be compared with older summaries using the metric
     // if the drift is more than the threshold, we will raise an alert
+    // Thresholds:
+    // - JENSEN_SHANNON: moderate (0.05-0.1), severe (>0.1)
+    // - HELLINGER: moderate (0.1-0.25), severe (>0.25)
+    // - PSI: moderate (0.1-0.2), severe (>0.2)

     // default drift metric to use
spark/src/main/scala/ai/chronon/spark/Driver.scala (4)
868-871: Enhance logging message with additional context.

The current log message could be more informative by including the properties being set.
-      logger.info(s"Creating table '${Constants.TiledSummaryDataset}'")
+      logger.info(s"Creating table '${Constants.TiledSummaryDataset}' with properties: $props")
882-883: Track TODO item for configuration management.

The TODO comment indicates a need to refactor the configuration handling to be more automated.

Would you like me to create a GitHub issue to track this technical debt item?

911-913: Enhance error logging with complete context.

The error message could be more helpful by including all relevant parameters.
-          logger.error(s"Failed to summarize and upload data for conf: $confPath", e)
+          logger.error(s"Failed to summarize and upload data. Conf: $confPath, Parquet: ${args.parquetPath()}", e)
900-901: Reduce code duplication by extracting confPath.

The confPath variable is used in multiple places and could be extracted to avoid repetition.
-      val confPath = args.confPath()
-      val summarizer = new Summarizer(confPath, timeColumn = args.timeColumn.toOption)
+      val confPath = args.confPath()
+      val summarizer = new Summarizer(confPath, timeColumn = args.timeColumn.toOption)
       try {
         val df = sparkSession.read.parquet(args.parquetPath())
         val (result, summaryExprs) = summarizer.computeSummaryDf(df)
-        val packer = new SummaryPacker(confPath, summaryExprs, summarizer.tileSize, summarizer.sliceColumns)
+        val packer = new SummaryPacker(summarizer.confPath, summaryExprs, summarizer.tileSize, summarizer.sliceColumns)
Also applies to: 905-905

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 2eaced3 and 5d37c05.

📒 Files selected for processing (9)

api/src/main/scala/ai/chronon/api/Builders.scala (3 hunks)
api/src/main/scala/ai/chronon/api/Constants.scala (2 hunks)
api/thrift/api.thrift (4 hunks)
build.sbt (1 hunks)
online/src/main/scala/ai/chronon/online/stats/DriftMetrics.scala (1 hunks)
spark/src/main/scala/ai/chronon/spark/Driver.scala (4 hunks)
spark/src/main/scala/ai/chronon/spark/stats/drift/Summarizer.scala (6 hunks)
spark/src/test/scala/ai/chronon/spark/test/JoinTest.scala (1 hunks)
spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (4)

api/src/main/scala/ai/chronon/api/Constants.scala
build.sbt
online/src/main/scala/ai/chronon/online/stats/DriftMetrics.scala
spark/src/test/scala/ai/chronon/spark/test/JoinTest.scala

🔇 Additional comments (8)

api/src/main/scala/ai/chronon/api/Builders.scala (3)

270-271: Past review comment is still applicable

The previous review comment about using Option[DriftSpec] instead of null is still valid and hasn't been addressed.

287-287: LGTM!

The historicalBackfill setter is correctly implemented and follows the established pattern.

296-297: Verify drift metrics integration and add tests

Since this PR is focused on drift metrics and is marked as WIP:

Please ensure proper integration with other drift-related components
Add unit tests to verify the drift functionality

Let's check for related drift metric tests and implementations:

✅ Verification successful

Let's gather more information about the test coverage and implementation details:

Based on the comprehensive test and implementation evidence gathered, I can now generate the final response:

Drift functionality is well-tested and integrated

The codebase shows robust drift metric implementation with:

Comprehensive unit tests covering different drift scenarios (low, moderate, severe) in DriftMetricsTest.scala
End-to-end integration tests in DriftTest.scala
Well-implemented drift metrics (Jensen-Shannon, PSI, Hellinger) with proper thresholds
Complete integration with the DriftStore and other drift-related components

The null check in the builder is consistent with the codebase patterns and the DriftSpec Thrift definition.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Search for drift-related tests and implementations

# Check for drift-related test files
echo "Searching for drift-related test files..."
fd -e scala -e java test.*Drift

# Check for drift metric implementations
echo "Searching for drift metric implementations..."
rg -l "class.*Drift|trait.*Drift|object.*Drift"

# Check for DriftSpec usage
echo "Checking DriftSpec usage..."
rg "DriftSpec"

Length of output: 1312

Script:

#!/bin/bash
# Check content of drift-related test files
echo "=== Content of DriftTest.scala ==="
cat "spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala"

echo -e "\n=== Content of DriftMetricsTest.scala ==="
cat "online/src/test/scala/ai/chronon/online/test/stats/DriftMetricsTest.scala"

# Check DriftSpec structure
echo -e "\n=== DriftSpec structure from Thrift ==="
rg -A 10 "struct DriftSpec" api/thrift/api.thrift

# Check implementation details
echo -e "\n=== DriftMetrics implementation ==="
cat "online/src/main/scala/ai/chronon/online/stats/DriftMetrics.scala"

Length of output: 13484

spark/src/main/scala/ai/chronon/spark/stats/drift/Summarizer.scala (3)

302-302: Verify nameToFilePath implementation

The change uses metadata.nameToFilePath for configuration path conversion. Ensure this method properly handles path conversion and exists in the MetaData class.

✅ Verification successful

Based on the results, I can now generate the final response:

nameToFilePath implementation is properly defined and consistently used

The implementation is verified in api/src/main/scala/ai/chronon/api/Extensions.scala:

def nameToFilePath: String = metaData.name.replaceFirst("\\.", "/")

The method correctly converts dot-notation names to file paths by replacing the first dot with a forward slash. This implementation is used consistently across the codebase for path conversion in various components including:

Spark jobs (LogFlattenerJob, ConsistencyJob)
Test suites (SchemaEvolutionTest, FetcherTest)
Online components (MetadataStore, Fetcher)

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check the implementation of nameToFilePath
rg "def\s+nameToFilePath|val\s+nameToFilePath" --type scala

# Check usage patterns
rg "\.nameToFilePath" --type scala -A 2

Length of output: 6241

30-30: Parameter rename improves clarity but verify usage

The rename from name to confPath better reflects the parameter's purpose as a configuration path.

✅ Verification successful

Parameter rename is safe and all usages are consistent

The verification shows that all instances of Summarizer instantiation are already using the parameter as a path:

In Driver.scala: new Summarizer(confPath, ...)
In Summarizer.scala: new Summarizer(metadata.nameToFilePath, ...)
No instances found using the old parameter name 'name'

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for any remaining references to the old parameter name
rg "new Summarizer\([^)]*name\s*=" --type scala

Length of output: 49

Script:

#!/bin/bash
# Search for Summarizer instantiations to check parameter usage
rg "new Summarizer\(" -A 2 --type scala

# Also search for references to Summarizer class to understand usage
rg "Summarizer" --type scala

Length of output: 1719

229-229: Verify thread safety of shared serializer

While the changes maintain consistency with the Summarizer class, using a shared compactSerializer requires careful consideration of thread safety in a Spark environment.

Also applies to: 246-246, 256-256

✅ Verification successful

Let me check the TSerializer implementation to understand its thread safety guarantees.

Based on the search results, I can now generate the final response since I have enough information about the serializer implementation:

Shared serializer is thread-safe by design

The compactSerializer is implemented in DriftStore as a factory method that creates a new instance for each call:

def compactSerializer: SerializableSerializer = new SerializableSerializer(new TBinaryProtocol.Factory())

This ensures thread safety as each thread gets its own serializer instance. The change from local definition to imported serializer is safe to use in Spark's distributed environment.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check the implementation of compactSerializer for thread safety
rg -A 5 "compactSerializer\s*=|def\s+compactSerializer" --type scala

Length of output: 1161

Script:

#!/bin/bash
# Check if TSerializer is instantiated per request or shared
rg -B 3 -A 3 "TSerializer|compactSerializer" --type scala

# Check if there are any comments or documentation about thread safety
rg -B 3 -A 3 "thread.*safe|concurrent" --type scala

Length of output: 46104

api/thrift/api.thrift (1)

302-302: Previous feedback remains unaddressed.

spark/src/main/scala/ai/chronon/spark/Driver.scala (1)

Line range hint 949-953: LGTM: Command registration changes are consistent.

The command registration and handling changes correctly reflect the object renaming from CreateStatsTable to CreateSummaryDataset.

Also applies to: 990-991

coderabbitai · 2024-11-20T18:11:17Z

spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala

+    // mock api impl for online fetching and uploading
+    val kvStoreFunc: () => KVStore = () => {
+      // cannot reuse the variable - or serialization error
+      val result = InMemoryKvStore.build("drift_test", () => null)
+      result
+    }


🛠️ Refactor suggestion

Add resource cleanup for KVStore

The KVStore instance should be properly cleaned up after the test to prevent resource leaks.

val kvStoreFunc: () => KVStore = () => { val result = InMemoryKvStore.build("drift_test", () => null) + // Register cleanup hook + sys.addShutdownHook { + try { + result.drop(Constants.MetadataDataset) + result.drop(Constants.TiledSummaryDataset) + } catch { + case e: Exception => + info(s"Failed to cleanup KVStore: ${e.getMessage}") + } + } result }

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2024-11-20T18:11:17Z

spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala

+    // fetch keys
+    val tileKeys = driftStore.tileKeysForJoin(join)
+    println(tileKeys)


🛠️ Refactor suggestion

Add assertions for DriftStore tile keys

The test prints tile keys but doesn't validate them.

// fetch keys val tileKeys = driftStore.tileKeysForJoin(join) - println(tileKeys) + tileKeys should not be empty + // Verify key format + tileKeys.foreach { key => + key should include(namespace) + key should include(join.metaData.nameToFilePath) + } + info(s"Fetched ${tileKeys.size} tile keys")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// fetch keys

val tileKeys = driftStore.tileKeysForJoin(join)

println(tileKeys)

// fetch keys

val tileKeys = driftStore.tileKeysForJoin(join)

tileKeys should not be empty

// Verify key format

tileKeys.foreach { key =>

key should include(namespace)

key should include(join.metaData.nameToFilePath)

}

info(s"Fetched ${tileKeys.size} tile keys")

coderabbitai · 2024-11-20T18:11:17Z

spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala

+    val startMs = PartitionSpec.daily.epochMillis("2023-01-01")
+    val endMs = PartitionSpec.daily.epochMillis("2023-01-29")
+    val summariesFuture = driftStore.getSummaries(join, Some(startMs), Some(endMs))
+    val summaries = Await.result(summariesFuture, Duration.create(10, TimeUnit.SECONDS))
+    println(summaries)


🛠️ Refactor suggestion

Improve async operation handling

The code uses hardcoded timeout duration and lacks proper error handling for async operations.

+ private val DefaultTimeout = Duration(10, TimeUnit.SECONDS) + val startMs = PartitionSpec.daily.epochMillis("2023-01-01") val endMs = PartitionSpec.daily.epochMillis("2023-01-29") val summariesFuture = driftStore.getSummaries(join, Some(startMs), Some(endMs)) - val summaries = Await.result(summariesFuture, Duration.create(10, TimeUnit.SECONDS)) + val summaries = try { + Await.result(summariesFuture, DefaultTimeout) + } catch { + case e: Exception => + fail(s"Failed to fetch summaries: ${e.getMessage}") + } - println(summaries) + summaries should not be empty + info(s"Fetched ${summaries.size} summaries")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

val startMs = PartitionSpec.daily.epochMillis("2023-01-01")

val endMs = PartitionSpec.daily.epochMillis("2023-01-29")

val summariesFuture = driftStore.getSummaries(join, Some(startMs), Some(endMs))

val summaries = Await.result(summariesFuture, Duration.create(10, TimeUnit.SECONDS))

println(summaries)

private val DefaultTimeout = Duration(10, TimeUnit.SECONDS)

val startMs = PartitionSpec.daily.epochMillis("2023-01-01")

val endMs = PartitionSpec.daily.epochMillis("2023-01-29")

val summariesFuture = driftStore.getSummaries(join, Some(startMs), Some(endMs))

val summaries = try {

Await.result(summariesFuture, DefaultTimeout)

} catch {

case e: Exception =>

fail(s"Failed to fetch summaries: ${e.getMessage}")

}

summaries should not be empty

info(s"Fetched ${summaries.size} summaries")

coderabbitai · 2024-11-20T18:11:17Z

spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala

+    // fetch drift series
+    val driftSeriesFuture = driftStore.getDriftSeries(
+      join.metaData.nameToFilePath,
+      DriftMetric.JENSEN_SHANNON,
+      lookBack = new Window(7, chronon.api.TimeUnit.DAYS),
+      startMs,
+      endMs
    )
+    val driftSeries = Await.result(driftSeriesFuture.get, Duration.create(10, TimeUnit.SECONDS))

-    val txnsTable = s"$namespace.txns"
-    spark.sql(s"DROP TABLE IF EXISTS $txnsTable")
+    driftSeries.foreach{s => println(s"${s.getKey.getColumn}: ${s.getPercentilDriftSeries.toScala}")}

-    DataFrameGen.events(spark, dollarTransactions, numRows, partitions = partitions).save(txnsTable)
-    TableUtils(spark).loadTable(txnsTable)
+    println("Drift series fetched successfully".green)


🛠️ Refactor suggestion

Add assertions for drift calculation results

The test prints drift series results but doesn't validate them. Add proper assertions to verify the expected behavior.

val driftSeriesFuture = driftStore.getDriftSeries( join.metaData.nameToFilePath, DriftMetric.JENSEN_SHANNON, lookBack = new Window(7, chronon.api.TimeUnit.DAYS), startMs, endMs ) - val driftSeries = Await.result(driftSeriesFuture.get, Duration.create(10, TimeUnit.SECONDS)) + val driftSeries = try { + Await.result(driftSeriesFuture.get, DefaultTimeout) + } catch { + case e: Exception => + fail(s"Failed to fetch drift series: ${e.getMessage}") + } - driftSeries.foreach{s => println(s"${s.getKey.getColumn}: ${s.getPercentilDriftSeries.toScala}")} + driftSeries should not be empty + driftSeries.foreach { s => + info(s"${s.getKey.getColumn}: ${s.getPercentilDriftSeries.toScala}") + // Verify drift values + s.getKey.getColumn should not be null + s.getPercentilDriftSeries should not be empty + // Verify drift values are within expected range + s.getPercentilDriftSeries.toScala.foreach { drift => + drift should be >= 0.0 + drift should be <= 1.0 + } + } - println("Drift series fetched successfully".green) + info("Drift series fetched successfully")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// fetch drift series

val driftSeriesFuture = driftStore.getDriftSeries(

join.metaData.nameToFilePath,

DriftMetric.JENSEN_SHANNON,

lookBack = new Window(7, chronon.api.TimeUnit.DAYS),

startMs,

endMs

)

val driftSeries = Await.result(driftSeriesFuture.get, Duration.create(10, TimeUnit.SECONDS))

val txnsTable = s"$namespace.txns"

spark.sql(s"DROP TABLE IF EXISTS $txnsTable")

driftSeries.foreach{s => println(s"${s.getKey.getColumn}: ${s.getPercentilDriftSeries.toScala}")}

DataFrameGen.events(spark, dollarTransactions, numRows, partitions = partitions).save(txnsTable)

TableUtils(spark).loadTable(txnsTable)

println("Drift series fetched successfully".green)

// fetch drift series

val driftSeriesFuture = driftStore.getDriftSeries(

join.metaData.nameToFilePath,

DriftMetric.JENSEN_SHANNON,

lookBack = new Window(7, chronon.api.TimeUnit.DAYS),

startMs,

endMs

)

val driftSeries = try {

Await.result(driftSeriesFuture.get, DefaultTimeout)

} catch {

case e: Exception =>

fail(s"Failed to fetch drift series: ${e.getMessage}")

}

driftSeries should not be empty

driftSeries.foreach { s =>

info(s"${s.getKey.getColumn}: ${s.getPercentilDriftSeries.toScala}")

// Verify drift values

s.getKey.getColumn should not be null

s.getPercentilDriftSeries should not be empty

// Verify drift values are within expected range

s.getPercentilDriftSeries.toScala.foreach { drift =>

drift should be >= 0.0

drift should be <= 1.0

}

}

info("Drift series fetched successfully")

piyush-zlai

@nikhil-zlai largely looks good. I flagged a couple of code changes that I'm not sure are intended as they seem to be yanking out some parts of the upstream base (and not directly linked to drift).

A q - Are your offset update changes part of this PR? I saw the old behavior still (start-offset, end) so confirming if I ether missed it elsewhere / needs to come as a follow up.

piyush-zlai · 2024-11-20T19:36:56Z

online/src/main/scala/ai/chronon/online/stats/DriftStore.scala

+        _ match {
+          case Success(responseContext) => Some(responseContext)
+          // TODO instrument failures
+          case Failure(exception) => exception.printStackTrace(); None


works for me

piyush-zlai · 2024-11-20T19:42:24Z

spark/src/main/scala/ai/chronon/spark/Driver.scala

@@ -491,45 +490,6 @@ object Driver {
    }
  }

-  object DailyStats {


revert? We shouldn't need to drop these right?

this is the half finished version from cristian - might as well kill it?

yeah I guess, would be nice to have this in a follow up as this PR is super large as is but now that its here we can go with this.

piyush-zlai · 2024-11-20T19:44:45Z

spark/src/test/scala/ai/chronon/spark/test/FetchStatsTest.scala

@@ -65,126 +47,7 @@ class FetchStatsTest extends TestCase {
  val namespace = "fetch_stats"
  TimeZone.setDefault(TimeZone.getTimeZone("UTC"))
  private val today = tableUtils.partitionSpec.at(System.currentTimeMillis())
-  private val yesterday = tableUtils.partitionSpec.before(today)
-
-  def testFetchStats(): Unit = {


same as above

coderabbitai

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (14)

spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala (2)
3-26: Consider organizing imports for better readability

Consider organizing imports into logical groups:

Scala/Java standard libraries

External dependencies

Internal chronon packages
 package ai.chronon.spark.test.stats.drift
 
-import ai.chronon
-import ai.chronon.api.ColorPrinter.ColorString
-import ai.chronon.api.Constants
-import ai.chronon.api.DriftMetric
-import ai.chronon.api.Extensions.MetadataOps
-import ai.chronon.api.PartitionSpec
-import ai.chronon.api.Window
-import ai.chronon.online.KVStore
-import ai.chronon.online.stats.DriftStore
-import ai.chronon.spark.SparkSessionBuilder
-import ai.chronon.spark.TableUtils
-import ai.chronon.spark.stats.drift.Summarizer
-import ai.chronon.spark.stats.drift.SummaryUploader
-import ai.chronon.spark.test.InMemoryKvStore
-import ai.chronon.spark.test.MockApi
-import org.apache.spark.sql.SparkSession
-import org.scalatest.flatspec.AnyFlatSpec
-import org.scalatest.matchers.should.Matchers
-import java.util.concurrent.TimeUnit
-import scala.concurrent.Await
-import scala.concurrent.duration.Duration
-import scala.util.ScalaJavaConversions.ListOps
+// Scala/Java standard libraries
+import java.util.concurrent.TimeUnit
+import scala.concurrent.{Await, duration}
+import scala.util.ScalaJavaConversions.ListOps
+
+// External dependencies
+import org.apache.spark.sql.SparkSession
+import org.scalatest.flatspec.AnyFlatSpec
+import org.scalatest.matchers.should.Matchers
+
+// Internal chronon packages
+import ai.chronon
+import ai.chronon.api.{
+  ColorPrinter, Constants, DriftMetric, 
+  Extensions, PartitionSpec, Window
+}
+import ai.chronon.api.ColorPrinter.ColorString
+import ai.chronon.api.Extensions.MetadataOps
+import ai.chronon.online.{KVStore, stats}
+import ai.chronon.online.stats.DriftStore
+import ai.chronon.spark.{SparkSessionBuilder, TableUtils}
+import ai.chronon.spark.stats.drift.{Summarizer, SummaryUploader}
+import ai.chronon.spark.test.{InMemoryKvStore, MockApi}
96-102: Review drift analysis time window configuration

The lookback window is hardcoded to 7 days. Consider:

Making this configurable via test parameters

Verifying if 7 days is sufficient for detecting drift patterns

Adding test cases with different window sizes to validate boundary conditions
+  private val DefaultLookbackDays = 7
+
   val driftSeriesFuture = driftStore.getDriftSeries(
     join.metaData.nameToFilePath,
     DriftMetric.JENSEN_SHANNON,
-    lookBack = new Window(7, chronon.api.TimeUnit.DAYS),
+    lookBack = new Window(DefaultLookbackDays, chronon.api.TimeUnit.DAYS),
     startMs,
     endMs
   )
online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala (3)
26-82: Consider breaking down the large method into smaller, focused methods.

The between method has grown quite large with multiple similar calculation blocks. Consider extracting each drift calculation into a separate private method to improve readability and maintainability.

Example refactoring:
+private def calculatePercentileDrift(a: TileSummary, b: TileSummary, metric: DriftMetric, result: TileDrift): Unit = {
+  if (a.isSetPercentiles && b.isSetPercentiles) {
+    val drift = percentileDistance(toArray(a.getPercentiles), toArray(b.getPercentiles), metric)
+    result.setPercentileDrift(drift)
+  }
+}

+private def calculateHistogramDrift(a: TileSummary, b: TileSummary, metric: DriftMetric, result: TileDrift): Unit = {
+  if (a.isSetHistogram && b.isSetHistogram) {
+    val drift = histogramDistance(a.getHistogram, b.getHistogram, metric)
+    result.setHistogramDrift(drift)
+  }
+}

 def between(a: TileSummary, b: TileSummary, metric: DriftMetric = DriftMetric.JENSEN_SHANNON): TileDrift = {
   val result = new TileDrift()
-  if (a.isSetPercentiles && b.isSetPercentiles) {
-    val drift = percentileDistance(toArray(a.getPercentiles), toArray(b.getPercentiles), metric)
-    result.setPercentileDrift(drift)
-  }
-  if (a.isSetHistogram && b.isSetHistogram) {
-    val drift = histogramDistance(a.getHistogram, b.getHistogram, metric)
-    result.setHistogramDrift(drift)
-  }
+  calculatePercentileDrift(a, b, metric, result)
+  calculateHistogramDrift(a, b, metric, result)
   // ... similar refactoring for other drift calculations
   result
 }
90-91: Track optimization TODO with an issue.

The TODO comment indicates a potential performance optimization. This should be tracked properly.

Would you like me to create a GitHub issue to track this optimization task? The issue would include:

Current implementation details

Performance implications

Proposed optimization approach using a single pass

86-104: Add detailed method documentation.

The method would benefit from more comprehensive documentation explaining:

The purpose and expected behavior

Parameter requirements and constraints

Return value format and meaning

Edge cases handling

Example documentation:
/**
 * Calculates drift metrics between pairs of tile summaries based on a time window.
 *
 * @param summariesWithTimestamps Array of (TileSummary, timestamp) pairs to analyze
 * @param metric The drift metric to use for calculations
 * @param startMs The starting timestamp for drift calculations
 * @param lookBackWindow The time window to look back for previous summaries
 * @return Array of (TileDrift, timestamp) pairs containing drift metrics
 *
 * Note: For each summary with timestamp >= startMs, the method finds a previous
 * summary using lookBackWindow and calculates drift between them.
 */
online/src/main/scala/ai/chronon/online/stats/PivotUtils.scala (3)
16-19: Remove outdated TODO comment

The TODO comment about implementing TileDrift => TileDriftSeries is outdated since this functionality is already implemented in this file.
-// TODO: TileDrift => TileDriftSeries
77-79: Consider adding documentation for the pipe operator

The pipe operator (|>) is a powerful feature, but its purpose might not be immediately clear to other developers. Consider adding ScalaDoc to explain its usage and benefits.
+  /**
+   * Enables function chaining using the |> operator.
+   * Example: value |> function1 |> function2
+   * is equivalent to: function2(function1(value))
+   */
   implicit class PipeOps[A](val value: A) extends AnyVal {
81-112: Consider performance optimization for large datasets

The current implementation creates multiple intermediate arrays and performs multiple iterations. For large datasets, this could impact performance. Consider:

Using a single pass to collect all required data

Preallocating collections with known sizes

Using parallel collections for large datasets
online/src/main/scala/ai/chronon/online/stats/DriftStore.scala (5)
35-35: Enhance error message for better debugging

The error message could be more descriptive by including the expected format or requirements for the drift tile size.
-    require(tileSize.nonEmpty, s"Drift tile size not set for join $joinName")
+    require(tileSize.nonEmpty, s"Drift tile size not set for join '$joinName'. Please ensure drift tile size is configured in join metadata.")
111-117: Enhance error handling with context information

When handling failures, include relevant context information such as the tile key and group name to aid in debugging.
 case Failure(exception) => 
-  exception.printStackTrace(); None
+  val errorContext = s"Failed to process response for tile key: ${requestContext.tileKey}, group: ${requestContext.groupName}"
+  logger.error(errorContext, exception)
+  None
163-178: Document method behavior and address TODO

The method would benefit from ScalaDoc documentation explaining its behavior and parameters. Additionally, the TODO about over-fetching should be tracked.

Would you like me to:

Generate ScalaDoc documentation for this method?

Create a GitHub issue to track the over-fetching optimization?

180-191: Maintain documentation consistency

For consistency with getDriftSeries, this method should also have ScalaDoc documentation explaining its behavior and parameters.

197-199: Enhance protocol warning documentation

The warning about the compact protocol could be more specific about the nature of the bug and why binary protocol is preferred.
-  // crazy bug in compact protocol - do not change to compact
+  // WARNING: Using binary protocol instead of compact protocol due to known serialization issues
+  // with complex nested structures in Thrift compact protocol.
+  // See: [Add link to issue/documentation if available]
api/thrift/api.thrift (1)
253-256: Enhance documentation for TileSummary struct.

The current documentation provides a basic overview, but could be enhanced with:

Examples of what constitutes a "slice"

Explanation of how categorical vs continuous types are handled

Description of container type handling
-// summary of distribution & coverage etc for a given (table, column, slice, tileWindow)
-// for categorical types, distribution is histogram, otherwise percentiles
-// we also handle container types by counting inner value distribution and inner value coverage
+/**
+ * Summary of distribution and coverage metrics for a given (table, column, slice, tileWindow).
+ * 
+ * @param percentiles For continuous types (numeric), stores distribution as percentiles
+ * @param histogram For categorical types, stores distribution as value frequencies
+ * @param count Total number of non-null values
+ * @param nullCount Total number of null values
+ * 
+ * Container type handling:
+ * - innerCount: Total elements across all containers (e.g., total items in all arrays)
+ * - innerNullCount: Total null elements across all containers
+ * - lengthPercentiles: Distribution of container lengths
+ * 
+ * String type handling:
+ * - stringLengthPercentiles: Distribution of string lengths for high cardinality fields
+ */

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 5d37c05 and c08d6c6.

📒 Files selected for processing (7)

api/thrift/api.thrift (4 hunks)
online/src/main/scala/ai/chronon/online/stats/DriftStore.scala (1 hunks)
online/src/main/scala/ai/chronon/online/stats/PivotUtils.scala (1 hunks)
online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala (1 hunks)
online/src/test/scala/ai/chronon/online/test/stats/PivotUtilsTest.scala (1 hunks)
spark/src/test/scala/ai/chronon/spark/test/FeatureWithLabelJoinTest.scala (1 hunks)
spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala (2 hunks)

✅ Files skipped from review due to trivial changes (1)

spark/src/test/scala/ai/chronon/spark/test/FeatureWithLabelJoinTest.scala

🚧 Files skipped from review as they are similar to previous changes (1)

online/src/test/scala/ai/chronon/online/test/stats/PivotUtilsTest.scala

🔇 Additional comments (9)

spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala (1)

46-55: Verify impact of reduced sample size

The sample size has been reduced from 1,000,000 to 100,000 records. While this improves test execution time, please verify that:

The smaller dataset still provides adequate coverage for drift detection
Edge cases are still represented in the reduced dataset
Performance characteristics can still be evaluated effectively

✅ Verification successful

Let me check the content of these files to understand if there are any specific performance requirements or benchmarks that might be affected by the reduced dataset size.

Let me check the default sample size in PrepareData and look for any drift detection thresholds or requirements that might be affected by the sample size.

Let me check if there are any test assertions or validations in the drift test that might depend on the sample size.

After analyzing the complete context, I can now provide the final response:

Sample size reduction is acceptable for this end-to-end test

The reduced sample size (100K) is sufficient for this test case because:

This is an end-to-end test that verifies the workflow functionality (data preparation, summarization, upload, and fetch) rather than statistical accuracy
The test focuses on validating the drift detection pipeline's operational aspects, including tile key generation, summary series, and drift series computation
No specific statistical thresholds or performance benchmarks are tied to the sample size in this test

The date range change (2023-01-01 to 2023-01-30) is intentional and appropriate as it aligns with the drift series validation period used in the test assertions.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if there are any performance-related tests or benchmarks that might be affected
rg -l "performance|benchmark|stress" --type scala

Length of output: 433

Script:

#!/bin/bash
# Check the content of test files for performance-related assertions or requirements
rg -A 5 -B 5 "performance|benchmark|stress" spark/src/test/scala/ai/chronon/spark/test/FetcherTest.scala
rg -A 5 -B 5 "performance|benchmark|stress" aggregator/src/test/scala/ai/chronon/aggregator/test/SawtoothAggregatorTest.scala

# Also check if there are any other drift tests with specific dataset size requirements
rg -A 5 -B 5 "generateFraudSampleData" spark/src/test/scala/ai/chronon/spark/test/stats/drift/

Length of output: 3702

Script:

#!/bin/bash
# Check for drift detection thresholds or requirements
rg -A 10 "class PrepareData" spark/src/test/scala/ai/chronon/spark/test/stats/drift/PrepareData.scala

# Look for any drift-related configurations or constants
rg "val.*Threshold|val.*Sample|val.*Size" spark/src/main/scala/ai/chronon/spark/stats/drift/

Length of output: 898

Script:

#!/bin/bash
# Check the complete DriftTest file for assertions and validations
cat spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala

# Also check for any drift-related test utilities
rg -l "DriftTest" spark/src/test/scala/ai/chronon/spark/test/

Length of output: 4582

online/src/main/scala/ai/chronon/online/stats/PivotUtils.scala (1)

1-15: LGTM! Well-organized imports

The package structure and imports are properly organized, with clear separation between API and Java types.

online/src/main/scala/ai/chronon/online/stats/DriftStore.scala (4)

1-27: LGTM! Well-structured class definition with appropriate imports

The class is well-organized with clear inheritance from MetadataStore and proper parameter defaults.

57-71: LGTM! Well-designed helper classes with clear responsibilities

The case classes and conversion methods are well-structured and maintain immutability.

131-143: LGTM! Well-documented time range handling

The Range class implementation is clear, with excellent documentation explaining the different scenarios. The logic correctly handles both small and large ranges with respect to lookback periods.

156-160: Consider handling empty sequence edge case

The sequence matching could fail if the Future.sequence returns an empty or differently sized sequence.

-      Future.sequence(Seq(currentSummaries, oldSummaries)).map {
-        case Seq(current, old) =>
-          old ++ current
+      Future.sequence(Seq(currentSummaries, oldSummaries)).map {
+        case Seq(current, old) =>
+          old ++ current
+        case other => 
+          logger.warn(s"Unexpected response sequence size: ${other.size}")
+          Seq.empty
       }

api/thrift/api.thrift (3)

297-318: Previous review comment about naming inconsistencies is still valid.

319-335: Previous review comment about restructuring is still valid.

393-395: LGTM!

The addition of driftSpec to MetaData is clean and well-documented.

coderabbitai · 2024-11-20T21:14:56Z

spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala

+    val summarySeriesFuture = driftStore.getSummarySeries(
+      join.metaData.nameToFilePath,
+      startMs,
+      endMs
+    )
+    val summarySeries = Await.result(summarySeriesFuture.get, Duration.create(10, TimeUnit.SECONDS))
+    summarySeries.foreach{s => println(s"${s.getKey.getColumn}: ${s.getPercentiles.toScala}")}
+    println("Summary series fetched successfully".green)


🛠️ Refactor suggestion

Add validation for summary series data

The test retrieves summary series but doesn't validate the data structure or content. Add assertions to verify:

The structure of the summary series

The completeness of the data

The validity of percentile values

val summarySeries = Await.result(summarySeriesFuture.get, Duration.create(10, TimeUnit.SECONDS)) - summarySeries.foreach{s => println(s"${s.getKey.getColumn}: ${s.getPercentiles.toScala}")} - println("Summary series fetched successfully".green) + summarySeries should not be empty + summarySeries.foreach { s => + withClue(s"Summary series for column ${s.getKey.getColumn}") { + s.getKey.getColumn should not be null + val percentiles = s.getPercentiles.toScala + percentiles should not be empty + // Verify percentile values are within valid range [0,1] + percentiles.foreach { p => + p should be >= 0.0 + p should be <= 1.0 + } + } + } + info("Summary series validated successfully")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

val summarySeriesFuture = driftStore.getSummarySeries(

join.metaData.nameToFilePath,

startMs,

endMs

)

val summarySeries = Await.result(summarySeriesFuture.get, Duration.create(10, TimeUnit.SECONDS))

summarySeries.foreach{s => println(s"${s.getKey.getColumn}: ${s.getPercentiles.toScala}")}

println("Summary series fetched successfully".green)

val summarySeriesFuture = driftStore.getSummarySeries(

join.metaData.nameToFilePath,

startMs,

endMs

)

val summarySeries = Await.result(summarySeriesFuture.get, Duration.create(10, TimeUnit.SECONDS))

summarySeries should not be empty

summarySeries.foreach { s =>

withClue(s"Summary series for column ${s.getKey.getColumn}") {

s.getKey.getColumn should not be null

val percentiles = s.getPercentiles.toScala

percentiles should not be empty

// Verify percentile values are within valid range [0,1]

percentiles.foreach { p =>

p should be >= 0.0

p should be <= 1.0

}

}

}

info("Summary series validated successfully")

will do in a later PR

@nikhil-zlai, understood. Would you like me to open a new GitHub issue to track this?

online/src/main/scala/ai/chronon/online/stats/PivotUtils.scala

api/thrift/api.thrift

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)

spark/src/main/scala/ai/chronon/spark/Driver.scala (2)

841-850: Consider adding more detailed logging.

While the implementation is correct, consider logging the properties being set for better observability:

 def run(args: Args): Unit = {
   logger.info(s"Creating table '${Constants.TiledSummaryDataset}'")
   val store = args.api.genKvStore
   val props = Map("is-time-sorted" -> "true")
+  logger.info(s"Setting properties: $props")
   store.create(Constants.TiledSummaryDataset, props)
 }

Line range hint 857-892: Enhance error handling with more context and cleanup.

While the error handling is good, it could be improved to provide more context and handle cleanup.

 try {
   val df = sparkSession.read.parquet(args.parquetPath())
   val (result, summaryExprs) = summarizer.computeSummaryDf(df)
   val packer = new SummaryPacker(confPath, summaryExprs, summarizer.tileSize, summarizer.sliceColumns)
   val (packed, _) = packer.packSummaryDf(result)
   val uploader = new SummaryUploader(packed, args.api)
   uploader.run()
 } catch {
   case e: Exception =>
-    logger.error(s"Failed to summarize and upload data for conf: $confPath", e)
+    logger.error(
+      s"""Failed to summarize and upload data:
+         |Conf: $confPath
+         |Parquet: ${args.parquetPath()}
+         |Time Column: ${args.timeColumn.getOrElse("default")}""".stripMargin, e)
+    // Consider cleanup of any temporary resources here
     throw e
 }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between c08d6c6 and 882dc80.

📒 Files selected for processing (4)

online/src/main/java/ai/chronon/online/JavaFetcher.java (0 hunks)
online/src/main/scala/ai/chronon/online/Fetcher.scala (1 hunks)
spark/src/main/scala/ai/chronon/spark/Driver.scala (5 hunks)
spark/src/test/scala/ai/chronon/spark/test/FetchStatsTest.scala (0 hunks)

💤 Files with no reviewable changes (2)

online/src/main/java/ai/chronon/online/JavaFetcher.java
spark/src/test/scala/ai/chronon/spark/test/FetchStatsTest.scala

🔇 Additional comments (3)

spark/src/main/scala/ai/chronon/spark/Driver.scala (3)

642-679: LGTM! Well-structured error handling and result formatting.

The changes implement proper error handling, sorted output using TreeMap, and clear logging of results with timing information.

927-930: LGTM! Consistent with the naming changes.

The argument class renaming maintains consistency with the object renaming from CreateStatsTable to CreateSummaryDataset.

968-969: LGTM! Consistent command handling.

The case matches are correctly updated to use the new names while maintaining the consistent pattern of command handling.

online/src/main/scala/ai/chronon/online/Fetcher.scala

nikhil-zlai · 2024-11-20T23:14:49Z

online/src/main/scala/ai/chronon/online/stats/DriftStore.scala

+    // e.g. if we have a range of 1 day and lookBack of 1 week, we will make two queries (8 days ago, 7 days ago) + (1 day ago, today)
+    // if the range is too large and the lookBack is too small, we will make single query
+    // e.g. if we have a range of 1 week and lookBack of 1 day, we will make single query - (8 days ago, today)
+    def lookBack(lookBackMs: Long): (Range, Option[Range]) = {


@piyush-zlai this logic does the lookback scanning

ah thanks for flagging!

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 882dc80 and 313b47c.

📒 Files selected for processing (1)

docs/source/setup/Online_Integration.md (1 hunks)

🔇 Additional comments (1)

docs/source/setup/Online_Integration.md (1)

21-22: LGTM! Parameter rename improves clarity.

The rename from afterTsMillis to startTsMillis better reflects that this is the start of a timestamp range scan. The documentation clearly explains the parameter's purpose and implementation considerations.

coderabbitai · 2024-11-20T23:17:59Z

docs/source/setup/Online_Integration.md

+  // `startTsMillis` implies that this is a range scan of all values with `timestamp` >= to the specified one. This can be implemented efficiently, if `timestamp` can be a secondary key. Some databases have a native version id concept which also can map to timestamp.
+  case class GetRequest(keyBytes: Array[Byte], dataset: String, startTsMillis: Option[Long] = None)


💡 Codebase verification

Update documentation and code comments to reflect the parameter name change from afterTsMillis to startTsMillis

The verification revealed inconsistencies in parameter naming across the codebase:

online/src/main/scala/ai/chronon/online/Api.scala: Contains outdated comments referring to afterTsMillis

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala: Uses afterTsMillis in comments

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala: Contains outdated reference to afterTsMillis

🔗 Analysis chain

Verify documentation consistency across the codebase.

Since this is a breaking API change, ensure that all related documentation and examples are updated to use startTsMillis.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Description: Check for any remaining references to afterTsMillis in documentation and examples # Search for afterTsMillis in markdown files echo "Checking markdown files for outdated parameter name..." rg --type md "afterTsMillis" # Search for afterTsMillis in code examples echo "Checking code examples for outdated parameter name..." rg --type scala "afterTsMillis"

Length of output: 891

## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced a method for converting objects to a pretty-printed JSON string format. - Added functionality for calculating drift metrics between `TileSummary` instances. - Enhanced drift analysis capabilities with new metrics and structures. - New endpoints for model prediction and model drift in the API. - Introduced utility functions for transforming and aggregating data related to `TileSummary` and `TileDrift`. - Enhanced metadata handling with new constants and improved dataset references. - Added a method for processing percentiles and breakpoints to generate interval assignments. - **Bug Fixes** - Improved error handling in various methods for better clarity and logging. - **Refactor** - Renamed variables and methods for clarity and consistency. - Updated method signatures to accommodate new features and improve usability. - Consolidated import statements for better organization. - Removed deprecated objects and methods to streamline functionality. - **Tests** - Added comprehensive unit tests for drift metrics and pivot functionality. - Enhanced test coverage for new and modified features. - Removed outdated tests and added new tests for handling key mappings in joins.

## Summary ## Cheour clientslist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced a method for converting objects to a pretty-printed JSON string format. - Added functionality for calculating drift metrics between `TileSummary` instances. - Enhanced drift analysis capabilities with new metrics and structures. - New endpoints for model prediction and model drift in the API. - Introduced utility functions for transforming and aggregating data related to `TileSummary` and `TileDrift`. - Enhanced metadata handling with new constants and improved dataset references. - Added a method for processing percentiles and breakpoints to generate interval assignments. - **Bug Fixes** - Improved error handling in various methods for better clarity and logging. - **Refactor** - Renamed variables and methods for clarity and consistency. - Updated method signatures to accommodate new features and improve usability. - Consolidated import statements for better organization. - Removed deprecated objects and methods to streamline functionality. - **Tests** - Added comprehensive unit tests for drift metrics and pivot functionality. - Enhanced test coverage for new and modified features. - Removed outdated tests and added new tests for handling key mappings in joins.

coderabbitai bot reviewed Nov 3, 2024

View reviewed changes

.github/workflows/test_scala_and_python.yaml Outdated Show resolved Hide resolved

coderabbitai bot reviewed Nov 3, 2024

View reviewed changes

.github/workflows/test_scala_and_python.yaml Outdated Show resolved Hide resolved

coderabbitai bot reviewed Nov 4, 2024

View reviewed changes

coderabbitai bot reviewed Nov 7, 2024

View reviewed changes

coderabbitai bot reviewed Nov 8, 2024

View reviewed changes

ken-zlai requested a review from piyush-zlai November 8, 2024 18:14

piyush-zlai reviewed Nov 8, 2024

View reviewed changes

nikhil-zlai force-pushed the drift_metrics branch from acee9b2 to 4cc9851 Compare November 12, 2024 13:37

coderabbitai bot reviewed Nov 12, 2024

View reviewed changes

ken-zlai requested review from varant-zlai and ken-zlai November 18, 2024 18:23

nikhil-zlai commented Nov 18, 2024

View reviewed changes

piyush-zlai requested changes Nov 18, 2024

View reviewed changes

nikhil-zlai added 13 commits November 18, 2024 14:56

Changes

c45017f

test fixes

cac3e34

changes so far

b4f0848

remove unused file

02fadac

fix

b2a1947

adding back color printer

40738b4

new runner

8c3d8f2

new runner

89ca7e5

scalafmt fix

d02202c

adjusting mem

cdeafb2

assign intervals

c2e15b8

assign intervals

c8e5070

tile summary distance

7a9eef4

nikhil-zlai added 6 commits November 18, 2024 14:56

histogram drift

c108a59

tile drift

616aac1

remove ks distance

a29dc9f

test wiring

9968772

rebase

de35bc7

calling into drift series in the test + bug fixes

2eaced3

nikhil-zlai force-pushed the drift_metrics branch from 4cc9851 to 2eaced3 Compare November 20, 2024 05:33

coderabbitai bot reviewed Nov 20, 2024

View reviewed changes

comments

5d37c05

coderabbitai bot reviewed Nov 20, 2024

View reviewed changes

piyush-zlai reviewed Nov 20, 2024

View reviewed changes

disable label test

c08d6c6

coderabbitai bot reviewed Nov 20, 2024

View reviewed changes

disable label test

882dc80

coderabbitai bot reviewed Nov 20, 2024

View reviewed changes

online/src/main/scala/ai/chronon/online/Fetcher.scala Show resolved Hide resolved

piyush-zlai approved these changes Nov 20, 2024

View reviewed changes

doc update

313b47c

nikhil-zlai commented Nov 20, 2024

View reviewed changes

coderabbitai bot reviewed Nov 20, 2024

View reviewed changes

test suppression

957f438

nikhil-zlai merged commit 6cb6273 into main Nov 21, 2024
8 checks passed

nikhil-zlai deleted the drift_metrics branch November 21, 2024 00:27

This was referenced Apr 6, 2025

chore: pull metrics into its own target #592

Merged

refactor: prior to introducing planner #586

Merged

	val bVal = if (bValOpt == null) bValOpt.toDouble else 0.0
	val bVal = if (bValOpt == null) 0.0 else bValOpt.toDouble

		val DriftStatsTable: String = "drift_statistics"

		val DefaultDriftTileSize: Window = new Window(30, TimeUnit.MINUTES)

		val maxGap = totalDays - windowLengths.sum
		val gapDays = RandomUtils.between(2, maxGap)

-  def between(a: TileSummary, b: TileSummary, metric: DriftMetric = DriftMetric.JENSEN_SHANNON): TileDrift = {
+  def between(a: TileSummary, b: TileSummary, metric: DriftMetric = DriftMetric.JENSEN_SHANNON): TileDrift = {
+    require(a != null, "After TileSummary (a) cannot be null")
+    require(b != null, "Before TileSummary (b) cannot be null")
+    require(metric != null, "DriftMetric cannot be null")

		driftSeries.foreach{s => println(s"${s.getKey.getColumn}: ${s.getPercentilDriftSeries.toScala}")}

		println("Drift series fetched successfully".green)

-    driftSeries.foreach{s => println(s"${s.getKey.getColumn}: ${s.getPercentilDriftSeries.toScala}")}
-    println("Drift series fetched successfully".green)
+    driftSeries.foreach { s =>
+      logger.info(s"${s.getKey.getColumn}: ${s.getPercentilDriftSeries.toScala}")
+      // Add assertions
+      s.getKey.getColumn should not be null
+      s.getPercentilDriftSeries should not be empty
+      // Verify drift values are within expected range
+      s.getPercentilDriftSeries.toScala.foreach { drift =>
+        drift should be >= 0.0
+        drift should be <= 1.0
+      }
+    }
+    logger.info("Drift series fetched successfully")

-    // fetch keys
-    val tileKeys = driftStore.tileKeysForJoin(join)
-    println(tileKeys)
+    // fetch keys
+    val tileKeys = driftStore.tileKeysForJoin(join)
+    tileKeys should not be empty
+    // Verify key format
+    tileKeys.foreach { key =>
+      key should include(namespace)
+      key should include(join.metaData.nameToFilePath)
+    }
+    info(s"Fetched ${tileKeys.size} tile keys")

		// `startTsMillis` implies that this is a range scan of all values with `timestamp` >= to the specified one. This can be implemented efficiently, if `timestamp` can be a secondary key. Some databases have a native version id concept which also can map to timestamp.
		case class GetRequest(keyBytes: Array[Byte], dataset: String, startTsMillis: Option[Long] = None)

[WIP] Drift metrics #59

[WIP] Drift metrics #59

Uh oh!

Conversation

nikhil-zlai commented Nov 3, 2024 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 8, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 8, 2024

Choose a reason for hiding this comment

nikhil-zlai commented Nov 3, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 3, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)