Swap our metrics provider from statsd to otel-metrics + instrument our Fetcher threadpool #657

piyush-zlai · 2025-04-16T19:42:15Z

Summary

PR to swap our metrics reporter from statsd to open telemetry metrics. We need otel to allow us to capture metrics in Etsy without the need of a prometheus statsd exporter sidecar that they've seen issues with occasionally. Otel in general is a popular metrics ingestion interface with a number of supported backends (e.g. prom / datadog / gcloud / aws cloudwatch). Wiring up Otel also enables us to set up traces and spans in the repo in the future.
Broad changes:

Decouple bulk of the metrics reporting logic from the Metrics.Context. The metrics reporter we use is pluggable. Currently this is just the OpenTelemetry but in principle we can support others in the future.
Online module creates the appropriate otel SDK - either we use the Http provider or the Prometheus Http server. We need the Http provider to plug into Vert.x as their Micrometer integration works with that. The Prom http server is what Etsy is keen we use.

Checklist

Added Unit Tests
Covered by existing CI
Integration tested
Documentation update

Tested via docker container and a local instance of open telemetry:
Start up fetcher docker svc

docker run -v ~/.config/gcloud/application_default_credentials.json:/gcp/credentials.json  -p 9000:9000  -e "GCP_PROJECT_ID=canary-443022"  -e "GOOGLE_CLOUD_PROJECT=canary-443022"  -e "GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance"  -e "EXPORTER_OTLP_ENDPOINT=http://host.docker.internal:4318"  -e GOOGLE_APPLICATION_CREDENTIALS=/gcp/credentials.json  zipline-fetcher:latest

And then otel:

./otelcol --config otel-collector-config.yaml
...

We see:

2025-04-18T17:35:37.351-0400	info	ResourceMetrics #0
Resource SchemaURL: 
Resource attributes:
     -> service.name: Str(ai.chronon)
     -> telemetry.sdk.language: Str(java)
     -> telemetry.sdk.name: Str(opentelemetry)
     -> telemetry.sdk.version: Str(1.49.0)
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope ai.chronon 3.7.0-M11
Metric #0
Descriptor:
     -> Name: kv_store.bigtable.cache.insert
     -> Description: 
     -> Unit: 
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> dataset: Str(TableId{tableId=CHRONON_METADATA})
     -> environment: Str(kv_store)
     -> production: Str(false)
StartTimestamp: 2025-04-18 21:31:52.180857637 +0000 UTC
Timestamp: 2025-04-18 21:35:37.18442138 +0000 UTC
Value: 1
Metric #1
Descriptor:
     -> Name: kv_store.bigtable.multiGet.latency
     -> Description: 
     -> Unit: 
     -> DataType: Histogram
     -> AggregationTemporality: Cumulative
HistogramDataPoints #0
Data point attributes:
     -> dataset: Str(TableId{tableId=CHRONON_METADATA})
     -> environment: Str(kv_store)
     -> production: Str(false)
StartTimestamp: 2025-04-18 21:31:52.180857637 +0000 UTC
Timestamp: 2025-04-18 21:35:37.18442138 +0000 UTC
Count: 1
Sum: 229.000000
Min: 229.000000
Max: 229.000000
ExplicitBounds #0: 0.000000
...
Buckets #0, Count: 0
...

Summary by CodeRabbit

New Features
- Introduced OpenTelemetry-based metrics reporting, replacing previous StatsD integration.
- Added support for OTLP and Prometheus metrics exporters with configurable endpoints.
- Enhanced thread pool executors with built-in metrics instrumentation and detailed task timing.
Refactor
- Updated metrics APIs to use structured tags and a pluggable reporter interface.
- Improved per-dataset metrics scoping and context management.
Chores
- Upgraded and expanded dependencies to include OpenTelemetry, Micrometer OTLP, and Prometheus libraries.
- Updated build and environment configuration to support new metrics exporters.

## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced a new query to retrieve purchase records with date range filtering. - Enhanced data retrieval by including additional contextual metadata for improved insights.   --------- Co-authored-by: Thomas Chow <[email protected]>

## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced dedicated testing workflows covering multiple system components to enhance overall reliability. - Added new test suites for various components to enhance testing granularity. - **Refactor** - Streamlined code organization with improved package structures and consolidated imports across test modules. - **Chores** - Upgraded automated testing configurations with optimized resource settings for improved performance and stability.   --------- Co-authored-by: Thomas Chow <[email protected]>

## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update   ## Summary by CodeRabbit - **Chores** - Adjusted the test execution timeout setting from a longer duration to 900 seconds to ensure tests complete more promptly.  Co-authored-by: Thomas Chow <[email protected]>

…nd not pushed to remote (#385) ## Summary I've been seeing that it's difficult to track what changes went into artifacts we push to etsy and canary. Especially when it comes to tracking performance regressions for spark jobs one day to the next. Adding a check to not allow any pushes to any customer artifacts if the branch is dirty. All changes need to at least be pushed to remote. And adding a metadata tag of the commit and branch ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced consistency checks during the build and upload process to verify that local changes are committed and branches are in sync. - Enhanced artifact metadata now includes additional context about the code state at the time of upload.

… top (#380) ## Summary While trying to read the updated beacon top topic we hit issues as the number of avro fields is greater than the Spark codegen limit default of 100. Thanks to this the wholestage codegen code is incorrect and we either end up with segfaults (unit tests) or garbled events (prod flink jobs). This PR bumps the limit to allow us to read beacon top (374 fields) as well as adds an assert in Catalyst util's whole stage code gen code to fail if we encounter this again in the future for a higher number of fields than our current bumped limit. ## Checklist - [X] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Enhanced data processing robustness with improved handling and early error detection for large schemas. - Refined SQL query formatting for clearer logical conditions. - **Tests** - Added a new validation for large schema deserialization. - Updated test definitions to improve structure and readability.

## Summary - Make the thrift gen python executable, use `py_binary` to support python generally ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Chores** - Enhanced the build process for a key automation tool by streamlining its execution and command handling, leading to improved overall build reliability and performance. - Transitioned the export mechanism of a Python script to a defined executable binary target for better integration within the build system.   Co-authored-by: Thomas Chow <[email protected]>

## Summary - Release Notes: https://spark.apache.org/releases/spark-release-3-5-4.html - https://issues.apache.org/jira/browse/SPARK-49791 is a good one for us. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update   ## Summary by CodeRabbit - **Chores** - Upgraded underlying Apache Spark libraries to version 3.5.4, delivering enhanced performance, stability, and compatibility. This update improves processing efficiency and backend reliability, ensuring smoother and more secure data operations. End-users may notice more robust and responsive interactions as a result of these improvements, further enhancing overall system performance.  Co-authored-by: Thomas Chow <[email protected]>

## Summary - Even though I'm eager to get ahead here, let's not go too crazy and accidentally shoot ourselves in the foot. Let's stay pinned to what our clusters have (3.5.1) until those upgrade. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update   ## Summary by CodeRabbit - **Chores** - Updated core Spark libraries—impacting SQL, Hive, Streaming, and Avro features—to version 3.5.1 to ensure enhanced stability and improved integration across Spark-powered functionalities.

## Summary Grant and I were chatting about the high number of hosts needed for the beacon top Flink jobs (24). This is because the topic parallelism is 96 and we squeeze 4 slots per TM (so 96 / 4 = 24 hosts). Given that folks often over provision Kafka topics in terms of partitions, going with a default of scaling down by 1/4th. Will look into wiring up Flink autoscaling as a follow up to not have this hardcoded. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Refactor** - Optimized stream processing by refining the parallelism calculation. The system now applies a scaling factor to better adjust the number of active processing units, which may result in improved efficiency under certain conditions.

## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Added interactive transformation controls for chart zooming, resetting, and scroll modes. - Introduced a detailed join configuration view and enhanced node graph visualization with tooltips and collapsible details. - Implemented a new component for displaying configuration properties related to joins in a tabular format. - Added new dependencies to enhance functionality and updated existing ones for better performance. - Introduced a new method for retrieving edges in the node graph and improved lineage data handling. - Added a new `StatusBar` component for job status representation. - Introduced sample data structures for lineage and job tracking responses. - Implemented a new function for managing expandable row states in job tracking tables. - **UI Improvements** - Refined chart layouts with updated padding and legend styling. - Improved navigation headers and tabs for clearer page organization. - Enhanced visual feedback for missing drift or distribution data. - Allowed interaction with previously disabled elements for improved user experience. - **Bug Fixes** - Improved error handling for data fetching and display logic. - **Chores** - Upgraded underlying dependencies and animations for smoother performance. - Removed obsolete files and functions related to job tracking data management.   --------- Co-authored-by: Sean Lynch <[email protected]> Co-authored-by: Ken Morton <[email protected]>

## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Documentation** - Clarified command instructions and added informational notes to set expectations during initial builds. - **New Features** - Introduced new build options for modular construction of components, including dedicated commands for hub and cloud modules. - Added an automated script to streamline the frontend build process. - **Chores** - Updated container setup and startup processes to utilize revised deployment artifacts.

## Summary - trim down tableutils - add iceberg runtime dependency to cloud_gcp ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Added a runtime dependency to enhance Spark processing. - Introduced a consolidated method for computing partition ranges. - **Refactor** - Streamlined import sections and simplified join analysis by removing redundant permission checks. - **Bug Fixes** - Removed methods related to table permission checks, impacting access control functionality. - **Tests** - Removed an outdated test for table permission verification. - **Chores** - Updated the project’s dependency configuration to include the new Spark runtime artifact.   --------- Co-authored-by: Thomas Chow <[email protected]>

## Summary Changed the backend code to only compute 3 percentiles (p5, p50, p95) for returning to the frontend. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Bug Fixes** - Enhanced statistical data processing to consistently handle cases with missing values by using a robust placeholder, ensuring clearer downstream analytics. - Adjusted the percentile chart configuration so that the 95th, 50th, and 5th percentiles are accurately rendered, providing more reliable insights for users. - Relaxed the null ratio validation in summary data, allowing for a broader acceptance of null values, which may affect drift metric interpretations. - **New Features** - Introduced methods for converting percentile strings to index values and filtering percentiles based on user-defined requests, improving data handling and representation.

## Summary Changes to support builds/tests with both scala 2.12 and 2.13 versions. By default we build against 2.12 version, pass "--config scala_2.13" option to "bazel build/test" to override it. ScalaFmt seems to be breaking for 2.13 using bazel rules_scala package, [fix](bazel-contrib/rules_scala#1631) is already deployed but a release with that change is not available yet, so temporarily disabled ScalaFmt checks for 2.13 will enable later once the fix is released. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Enabled flexible Scala version selection (2.12 and 2.13) for smoother builds and enhanced compatibility. - Introduced a default Scala version constant and a repository rule for improved version management. - Added support for additional Scala 2.13 dependencies in the build configuration. - **Refactor and Improvements** - Streamlined build and dependency management for increased stability and performance. - Consolidated collection conversion utilities to boost reliability in tests and runtime processing. - Enhanced type safety and clarity in collection handling across various modules. - Improved handling of Scala collections and maps throughout the codebase for better type consistency and safety. - Updated method implementations to ensure explicit type conversions, enhancing clarity and preventing runtime errors. - Modified method signatures and internal logic to utilize `Seq` for improved type clarity and consistency. - Enhanced the `maven_artifact` function to accept an optional version parameter for better dependency management.

…velte (#399) ## Summary Updated all dependencies to latest (ignore Tailwind and bits-ui for now) so we don't lag too far behind. The LayerChart bump also fixes a BarChart tooltip [issue](https://github.com/techniq/layerchart/releases/tag/layerchart%400.99.2). Ran `npm run check`, `npm run lint`, and `npm run test:unit` locally (although CI will validate). Also kicked the app around locally to spot any regressions. ![image](https://github.com/user-attachments/assets/2df26fff-e00c-4e3f-a1ea-a52a75588b2e) ![image](https://github.com/user-attachments/assets/16b01322-1f4e-4e15-a868-9b6227bf0d10) ![image](https://github.com/user-attachments/assets/7f51a50f-ca74-4a81-93b6-e8710b90e511) ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Chores** - Updated multiple dependencies and development tools to their latest versions for improved performance and compatibility. - **Refactor** - Enhanced UI components by enforcing stricter type definitions and consistency. - Improved the presentation and configuration of the `Inspect` component in the overview page. These updates streamline the build process and ensure a more reliable, consistent user interface without altering its overall functionality.   --------- Co-authored-by: Sean Lynch <[email protected]>

## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update   ## Summary by CodeRabbit - **Chores** - Updated automated testing triggers to run for changes on any branch, ensuring broader test coverage for frontend updates.  Co-authored-by: Sean Lynch <[email protected]>

## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced new methods for retrieving lineage information for group by, model, and staging query configurations. - Enhanced configuration detail pages with improved navigation and interactive elements. - Added a new asynchronous function for handling data fetching based on configuration parameters. - Implemented a tooltip feature in the DateRangeSelector component for better user guidance. - Updated the PageHeader and TabsTrigger components to enhance routing and display context. - Added URLs for various logical types to improve navigation. - **Refactor** - Standardized lineage conversion processes and updated data references across views. - Improved reactive behavior and routing for configuration and observability pages. - Simplified layout structures for better scrolling behavior and UI clarity. - **Chores** - Removed deprecated endpoints, components, and legacy pages to streamline the application.  --- - To see the specific tasks where the Asana app for GitHub is being used, see below: - https://app.asana.com/0/0/1209162816962776 - https://app.asana.com/0/0/1209438100557823  --------- Co-authored-by: Sean Lynch <[email protected]>

## Summary - #381 introduced the ability to configure a partition column at the node-level. This PR simply fixes a missed spot on the plumbing of the new StagingQuery attribute. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Enhanced the query builder to support specifying a partition column, providing greater customization for query formation and partitioning. - **Improvements** - Improved handling of partition columns by introducing a fallback mechanism to ensure valid values are used when necessary.   --------- Co-authored-by: Thomas Chow <[email protected]>

## Summary To add CI checks for making sure we are able to build and test all modules on both scala 2.12 and 2.13 versions. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Chores** - Updated automated testing workflows to support Scala 2.12 and added new workflows for Scala 2.13, ensuring consistent testing for both Spark and non-Spark modules. - **Documentation** - Enhanced build instructions with updated commands for creating Uber Jars and new automation shortcuts to streamline code formatting, committing, and pushing changes.

## Summary Added pinning support for both our maven and spark repositories so we don't have to resolve them during builds. Going forward whenever we make any updates to the artifacts in either maven or spark repositories, we would need to re-pin the changed repos using following commands and check-in the updated json files. ``` REPIN=1 bazel run @maven//:pin REPIN=1 bazel run @spark//:pin ``` ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Integrated enhanced repository management for Maven and Spark, providing improved dependency installation. - Added support for JSON configuration files for Maven and Spark installations. - **Chores** - Updated documentation to include instructions on pinning Maven artifacts and managing dependency versions effectively.

A VSCode plugin for feature authoring that detects errors and uses data sampling in order to speed up the iteration cycle. The goal is to reduce the amount of memorizing commands, typing / clicking, waiting for clusters to be spun up, and jobs to finish. In this example, we have a complex expression operating on nested data. The eval button appears above Chronon types. When you click on the Eval button, it samples your data, runs your code and shows errors or transformed result within seconds. ![zipline_vscode_plugin](https://github.com/user-attachments/assets/5ac56764-f6e7-4998-b5aa-1f4cabde42f9) ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [x] Integration tested (see above) - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced a new Visual Studio Code extension that enhances Python development. - The extension displays an evaluation button alongside specific assignment statements in Python files, allowing users to trigger evaluation commands directly in the terminal. - Added a command to execute evaluation actions related to Zipline AI configurations. - **Documentation** - Added a new LICENSE file containing the MIT License text. - **Configuration** - Introduced new configuration files for TypeScript and Webpack to support the extension's development and build processes. - **Exclusions** - Updated `.gitignore` and added `.vscodeignore` to streamline version control and packaging processes.

## Summary Moved scala dependencies to separate scala_2_12 and scala_2_13 repositories so we can load the right repo based on config instead of loading both. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit ## Summary by CodeRabbit - **Chores** - Upgraded Scala dependencies to newer versions with updated verification, ensuring improved stability. - Removed outdated package references to streamline dependency management. - Introduced new repository configurations for Scala 2.12 and 2.13 to enhance dependency management. - Added `.gitignore` entry to exclude `node_modules` in the `authoring/vscode` path. - Created `LICENSE` file with MIT License text for the new extension. - **New Features** - Introduced a Visual Studio Code extension with a CodeLens provider for Python files, allowing users to evaluate variables directly in the editor. - **Refactor** - Updated dependency declarations to utilize a new method for handling Scala artifacts, improving consistency across the project.  --------- Co-authored-by: Nikhil Simha <[email protected]>

## Summary Per the conversation on [Slack](https://zipline-2kh4520.slack.com/archives/C0880ECQ0EN/p1739982805433429). We don't currently have an example where a `GroupBy` takes in both a streaming and a batch source (mostly are a single source, and a few that are not are batch only), but the `.some()`/any logic should be [right](https://zipline-2kh4520.slack.com/archives/C0880ECQ0EN/p1739830806387109) Conf | Before | After --- | --- | --- Join | ![image](https://github.com/user-attachments/assets/0f989d85-f9a5-4612-9355-1c4f8efd5d7d) | ![image](https://github.com/user-attachments/assets/dc0d9613-28b4-4156-921e-24210b52d6d3) GroupBy (no change) | ![image](https://github.com/user-attachments/assets/eb2727e3-a3ed-4ef5-8faf-3b594f7dc9d5) | ![image](https://github.com/user-attachments/assets/957812d4-da96-42fe-b361-7804476f416c) ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  --- - To see the specific tasks where the Asana app for GitHub is being used, see below: - https://app.asana.com/0/0/1209446053233127  ## Summary by CodeRabbit - **Refactor** - Improved the application’s handling of streaming content for enhanced reliability and performance. The updates streamline the evaluation of live data and combined streams, contributing to a more consistent and responsive user experience.  Co-authored-by: Sean Lynch <[email protected]>

## Summary Not sure the root cause of using `queryParameters()` within the drift/distributions route is not reactive when client-side routed (clicking on the Observability tab or clicking between Drift and Distributions) but works if you refresh the page. It is reactive within `ActionsButtons` and generally everywhere else. At some point I'll workup a minimal repo and report on https://github.com/paoloricciuti/sveltekit-search-params, but this approach is both simpler and resolves the issue (and still uses the same param config). My initial guess is it might be due to subscribing to the same params with 2 different instances on the same page, but not positive without investigating (and possibly not related). [Asana issue](https://app.asana.com/0/1208932365725930/1209459569820188) https://github.com/user-attachments/assets/3b825bae-484b-4dba-8c35-0e84b0b65ce0 ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  --- - To see the specific tasks where the Asana app for GitHub is being used, see below: - https://app.asana.com/0/0/1209459569820188 Co-authored-by: Sean Lynch <[email protected]>

## Summary Before | After --- | --- ![image](https://github.com/user-attachments/assets/6310f7ec-c606-463a-89f9-76d1c719b659) | ![image](https://github.com/user-attachments/assets/a77acf52-195d-4e02-b6e2-164361540ff0) ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update --- - To see the specific tasks where the Asana app for GitHub is being used, see below: - https://app.asana.com/0/0/1209438100557826  ## Summary by CodeRabbit - **New Features** - Enhanced the configuration overview by introducing intuitive, color-coded icons. Now, the statuses for online and production properties are displayed with clear visual indicators, offering immediate feedback on their state. - Updated the visual representation of the "False" state in the badge component for consistency and improved clarity.   --------- Co-authored-by: Sean Lynch <[email protected]>

## Summary Adds AWS build and push commands to the distribution script. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced an automated quickstart process for GCP deployments. - Enhanced the build and upload tool with flexible command-line options, supporting artifact creation for both AWS and GCP environments. - Added a new script for running the Zipline quickstart on GCP. - **Refactor** - Updated the AWS quickstart process to ensure consistent execution.

…FilePath and replacing `/` to `.` in MetaData names (#398) ## Summary ^^^ Tested on the etsy laptop. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Bug Fixes** - Improved error handling to explicitly report when configuration values are missing. - **New Features** - Introduced standardized constants for various configuration types, ensuring consistent key naming. - **Refactor** - Unified metadata processing by using direct metadata names instead of file paths. - Enhanced type safety in configuration options for clearer and more reliable behavior. - **Tests** - Updated test cases and parameters to reflect the improved metadata and configuration handling.

Reverts #373 Passing in options to push to only one customer is broken.  ## Summary by CodeRabbit - **Refactor** - Streamlined the deployment process to automatically build and upload artifacts exclusively to Google Cloud Platform. - Removed configuration options and handling for an alternative cloud provider, resulting in a simpler, more focused workflow.

## Summary building join output schema should belong to metadata store - and also reduces the size of fetcher. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced an optimized caching mechanism for data join operations, resulting in improved performance and reliability. - Added new methods to facilitate the creation and management of join codecs. - **Bug Fixes** - Enhanced error handling for join codec operations, ensuring clearer context for failures. - **Documentation** - Improved code readability and clarity through updated comments and method signatures.

## Summary Skeleton implementation for Agent service with job polling and status reporting happening at regular intervals which can be configured. A lot of place holder code which is going to change a lot in coming PR's so didn't add tests for now, will add them in coming PR's ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced configuration management for agent components, including polling intervals and service endpoints. - Added abstractions and in-memory implementations for job execution and key-value storage services. - Implemented periodic job polling and status reporting handlers for orchestration agents. - Added Vert.x verticles to manage job polling and status reporting workflows. - **Improvements** - Enhanced job struct to use a detailed job info field. - Expanded job status types with an explicit "unknown" state and updated status codes. - **Dependency Updates** - Included additional Vert.x dependencies for URI template support. - Packaged resource files into the orchestration binary. - **Bug Fixes** - Updated job handling logic and tests to reflect changes in job ID access patterns. - **Tests** - Adjusted tests to accommodate the new job info structure.

## Summary - We need to bring back the v1 version of Datasource for spark bigquery connector, since it supports partition pushdown. And alternative project_id's. The catalog version in the spark bigquery connector does not support that. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit ## Summary by CodeRabbit - **New Features** - Enhanced table reading capabilities with support for applying partition filters and combining multiple predicates for more flexible data queries. - **Refactor** - Improved internal handling of predicate filters and table loading logic for more consistent and maintainable data access. - Refined data filtering by explicitly incorporating partition column information for more precise queries. - **Chores** - Updated script to ensure temporary files are cleaned up more reliably during installation processes.   --------- Co-authored-by: Thomas Chow <[email protected]>

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (3)

service_commons/src/main/java/ai/chronon/service/ChrononServiceLauncher.java (2)

27-33: Update class comment.

Comment on lines 28-30 still refers to StatsD, should be updated to reflect OpenTelemetry usage.

92-94: Verify duration format.

Ensure exportInterval is in ISO-8601 format (e.g., "PT15S") for Duration.parse().

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala (1)

155-159: Repeated copy(...) per call is GC‑heavy

Creating a new Context for every dataset on every call allocates objects and populates the tag cache again.
Consider caching only once per dataset and re‑using the same instance.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 289e64f and 1e571bd.

📒 Files selected for processing (15)

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (2 hunks)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala (18 hunks)
docker/fetcher/start.sh (1 hunks)
online/src/main/java/ai/chronon/online/JavaFetcher.java (7 hunks)
online/src/main/scala/ai/chronon/online/Api.scala (3 hunks)
online/src/main/scala/ai/chronon/online/fetcher/FetchContext.scala (2 hunks)
online/src/main/scala/ai/chronon/online/fetcher/Fetcher.scala (9 hunks)
online/src/main/scala/ai/chronon/online/fetcher/FetcherCache.scala (3 hunks)
online/src/main/scala/ai/chronon/online/metrics/Metrics.scala (6 hunks)
online/src/main/scala/ai/chronon/online/metrics/MetricsReporter.scala (1 hunks)
online/src/main/scala/ai/chronon/online/metrics/OtelMetricsReporter.scala (1 hunks)
online/src/test/scala/ai/chronon/online/test/TagsTest.scala (2 hunks)
service/src/main/java/ai/chronon/service/FetcherVerticle.java (3 hunks)
service_commons/src/main/java/ai/chronon/service/ChrononServiceLauncher.java (3 hunks)
service_commons/src/main/java/ai/chronon/service/OpenTelemetryProvider.java (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (8)

online/src/main/scala/ai/chronon/online/fetcher/FetchContext.scala
online/src/test/scala/ai/chronon/online/test/TagsTest.scala
service/src/main/java/ai/chronon/service/FetcherVerticle.java
cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala
online/src/main/scala/ai/chronon/online/fetcher/FetcherCache.scala
online/src/main/scala/ai/chronon/online/Api.scala
online/src/main/scala/ai/chronon/online/fetcher/Fetcher.scala
online/src/main/java/ai/chronon/online/JavaFetcher.java

🧰 Additional context used

🧬 Code Graph Analysis (1)

online/src/main/scala/ai/chronon/online/metrics/OtelMetricsReporter.scala (1)

online/src/main/scala/ai/chronon/online/metrics/Metrics.scala (9)

Metrics (25-219)

Context (89-132)

Context (134-218)

toTags (158-183)

count (213-213)

gauge (215-215)

gauge (217-217)

distribution (207-210)

distribution (210-213)

⏰ Context from checks skipped due to timeout of 90000ms (35)

GitHub Check: service_commons_tests
GitHub Check: service_tests
GitHub Check: cloud_gcp_tests
GitHub Check: online_tests
GitHub Check: cloud_aws_tests
GitHub Check: streaming_tests
GitHub Check: hub_tests
GitHub Check: service_tests
GitHub Check: join_tests
GitHub Check: streaming_tests
GitHub Check: flink_tests
GitHub Check: online_tests
GitHub Check: spark_tests
GitHub Check: cloud_gcp_tests
GitHub Check: groupby_tests
GitHub Check: join_tests
GitHub Check: hub_tests
GitHub Check: analyzer_tests
GitHub Check: cloud_aws_tests
GitHub Check: groupby_tests
GitHub Check: spark_tests
GitHub Check: flink_tests
GitHub Check: api_tests
GitHub Check: fetcher_tests
GitHub Check: batch_tests
GitHub Check: aggregator_tests
GitHub Check: aggregator_tests
GitHub Check: analyzer_tests
GitHub Check: orchestration_tests
GitHub Check: fetcher_tests
GitHub Check: api_tests
GitHub Check: batch_tests
GitHub Check: scala_compile_fmt_fix
GitHub Check: orchestration_tests
GitHub Check: enforce_triggered_workflows

🔇 Additional comments (6)

service_commons/src/main/java/ai/chronon/service/OpenTelemetryProvider.java (1)

11-21: Implementation looks clean.

Thread-safe singleton design with Optional for safe instance access.

service_commons/src/main/java/ai/chronon/service/ChrononServiceLauncher.java (1)

65-107: Looks good, but consider caching the OpenTelemetry instance.

The implementation correctly sets up OpenTelemetry for metrics reporting.

docker/fetcher/start.sh (1)

21-22: Log message update is appropriate.

Script correctly checks for OTLP endpoint instead of StatsD host.

online/src/main/scala/ai/chronon/online/metrics/MetricsReporter.scala (2)

5-10: Clean provider interface.

Generic type parameter enables flexibility for different client implementations.

12-24: Well-designed reporter trait.

Interface supports all common metric types with consistent method signatures.

online/src/main/scala/ai/chronon/online/metrics/OtelMetricsReporter.scala (1)

62-66: DoubleGauge build call may not compile on latest OTel

meter.gaugeBuilder(metric).build() returns an observable builder that expects a callback, not a direct set.
Verify against the OTel API in build.sbt; you may need ofDoubles().buildWithCallback(...) or switch to DoubleUpDownCounter.

coderabbitai · 2025-04-18T21:58:25Z

service_commons/src/main/java/ai/chronon/service/ChrononServiceLauncher.java

+        OtlpConfig otlpConfig = key -> {
+            switch (key) {
+                case "otlp.url":
+                    return exporterUrl;
+                case "otlp.step":
+                    return exportInterval;
+                case "otlp.resourceAttributes":
+                    return "service.name=" + serviceName;
+                default:
+                    return null;
            }
        };


💡 Verification agent

🧩 Analysis chain

Verify Micrometer OTLP config keys.

Modern Micrometer versions (≥1.12) may expect different configuration keys. Consider confirming:

- case "otlp.url": + case "otlp.metrics.export.url":

🌐 Web query:

What configuration keys does OtlpMeterRegistry use in Micrometer version 1.12+?

💡 Result:

Configuration Keys for OtlpMeterRegistry in Micrometer 1.12+

OtlpMeterRegistry uses a set of configuration keys to control behaviors such as endpoint selection, timeouts, resource attributes, and other telemetry settings. These keys are provided to the registry via the OtlpConfig interface, which supports property-based configuration—commonly via Spring Boot properties with the prefix management.otlp.metrics.export[1].

Common Configuration Keys

Below are the primary configuration keys available for OtlpMeterRegistry in Micrometer 1.12 and newer:

Key Purpose Default/Example Value

enabled Whether OTLP export is enabled true

endpoint The OTLP metrics endpoint (URL) (none, must be set)

step Frequency to report metrics (duration) PT1M (1 minute)

compression Compression type for payload (e.g., gzip) none

headers Additional headers for requests (key-value pairs) (none)

aggregation-temporality Preferred temporality for cumulative or delta data cumulative or delta

resource-attributes Resource attributes attached to all metrics (comma-separated key-value pairs) See below

timeout Request timeout (duration) 10s

base-time-unit Base time unit for timers milliseconds

Resource Attributes:

The following attributes are set by default:

telemetry.sdk.name: io.micrometer

telemetry.sdk.language: java

telemetry.sdk.version: Micrometer version (e.g., 1.12.0)

service.name: unknown_service (overridable by OTEL_SERVICE_NAME environment variable)

Custom resource attributes can be set using the resource-attributes key, or, if empty, loaded from the OTEL_RESOURCE_ATTRIBUTES environment variable[1].

Example: Spring Boot Property Mapping

When using Spring Boot, you can set these in your application.properties or application.yaml:

management.otlp.metrics.export.enabled=true management.otlp.metrics.export.endpoint=https://otel-collector:4318/v1/metrics management.otlp.metrics.export.step=30s management.otlp.metrics.export.headers=Authorization=Bearer <token> management.otlp.metrics.export.aggregation-temporality=delta management.otlp.metrics.export.resource-attributes=service.name=my-service,env=prod management.otlp.metrics.export.timeout=5s

Notes

The full set of supported keys is defined by the OtlpConfig interface. When implementing or overriding its get(String key) method, properties are fetched by their keys, typically with the prefix management.otlp.metrics.export when using Spring Boot’s auto-configuration[1][8].

Environmental variables such as OTEL_RESOURCE_ATTRIBUTES and OTEL_SERVICE_NAME can influence resource attributes and take precedence over some configuration keys[1].

The list above covers the most commonly used and documented keys as of Micrometer 1.12+; newer versions may introduce additional or changed keys.

For the latest list and advanced options, consult the official Micrometer OTLP documentation[1].

Citations:

1: https://docs.micrometer.io/micrometer/reference/implementations/otlp.html

2: https://github.com/micrometer-metrics/micrometer/wiki/1.13-Migration-Guide

3: https://docs.micrometer.io/micrometer/reference/implementations/prometheus.html

4: https://github.com/micrometer-metrics/micrometer/blob/main/implementations/micrometer-registry-otlp/src/main/java/io/micrometer/registry/otlp/OtlpMeterRegistry.java

5: https://docs.dynatrace.com/docs/ingest-from/extend-dynatrace/extend-metrics/ingestion-methods/micrometer

6: https://spring.io/blog/2018/03/16/micrometer-spring-boot-2-s-new-application-metrics-collector

7: https://docs.spring.io/spring-boot/reference/actuator/metrics.html

8: https://javadoc.io/static/io.micrometer/micrometer-registry-otlp/1.12.12/allclasses-index.html

9: https://www.javadoc.io/doc/io.micrometer/micrometer-core/1.0.3/io/micrometer/core/instrument/MeterRegistry.Config.html

10: https://mvnrepository.com/artifact/io.micrometer/micrometer-registry-otlp/1.12.5

Use correct Micrometer OTLP config keys

Micrometer 1.12+ expects the keys endpoint, step and resource-attributes, not otlp.*.

• File: service_commons/src/main/java/ai/chronon/service/ChrononServiceLauncher.java (lines 52–63)

Suggested diff:

- case "otlp.url": + case "endpoint": return exporterUrl; - case "otlp.step": + case "step": return exportInterval; - case "otlp.resourceAttributes": - return "service.name=" + serviceName; + case "resource-attributes": + return "service.name=" + serviceName;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

OtlpConfig otlpConfig = key -> {

switch (key) {

case "otlp.url":

return exporterUrl;

case "otlp.step":

return exportInterval;

case "otlp.resourceAttributes":

return "service.name=" + serviceName;

default:

return null;

}

};

OtlpConfig otlpConfig = key -> {

switch (key) {

case "endpoint":

return exporterUrl;

case "step":

return exportInterval;

case "resource-attributes":

return "service.name=" + serviceName;

default:

return null;

}

};

coderabbitai · 2025-04-18T21:58:25Z

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala

+  protected val metricsContext: Metrics.Context =
+    Metrics.Context(Metrics.Environment.KVStore, client = metricsClient).withSuffix("bigtable")
+
+  protected val tableToContext = new TrieMap[String, Metrics.Context]()



🛠️ Refactor suggestion

Unbounded tableToContext risks memory leak

TrieMap keeps a distinct Context per BigTable id forever. If the service touches many ephemeral tables, the map will grow without limit.
Add TTL/LRU eviction or at least expose a cleanup hook.

we are capped here as we have only 4 tables to contend with

@piyush-zlai Understood! Since you're only working with 4 tables total, the memory usage for tableToContext will remain bounded and small. That's a perfectly reasonable approach given your specific use case.

✏️ Learnings added

Learnt from: piyush-zlai PR: zipline-ai/chronon#657 File: cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala:93-97 Timestamp: 2025-04-21T15:10:40.802Z Learning: The BigTableKVStoreImpl in the chronon codebase only interacts with 4 BigTable tables total, so unbounded caching in tableToContext is not a concern.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

coderabbitai · 2025-04-18T21:58:25Z

online/src/main/scala/ai/chronon/online/metrics/OtelMetricsReporter.scala

+import org.json4s.BuildInfo
+


⚠️ Potential issue

Wrong BuildInfo reference

org.json4s.BuildInfo.version is the json4s library version, not Chronon’s.
Use your project’s generated ai.chronon.BuildInfo (or similar) to avoid misleading instrumentation metadata.

removed this for now. Will circle back and add a version later once we wire this into our bazel setup

@piyush-zlai That's a sensible approach. Removing the potentially incorrect BuildInfo reference now and adding the correct version information later when you have proper Bazel integration will avoid any misleading instrumentation metadata. Looking forward to seeing the final implementation when you circle back to it.

coderabbitai · 2025-04-18T21:58:25Z

online/src/main/scala/ai/chronon/online/metrics/OtelMetricsReporter.scala

+  private val counters = new TrieMap[String, LongCounter]()
+  private val longGauges = new TrieMap[String, LongGauge]()
+  private val doubleGauges = new TrieMap[String, DoubleGauge]()
+  private val histograms = new TrieMap[String, LongHistogram]()
+


🛠️ Refactor suggestion

Instrument caches lack eviction

counters, gauges, and histograms grow with every new metric name.
If metric names are dynamic (e.g., per‑dataset), memory will balloon.
Consider bounded caches or pre‑register a fixed set of names.

the names are fixed based on what we've explicitly configured in the code.

@piyush-zlai Got it, thanks for clarifying. Since the metric names are fixed and explicitly configured in the code rather than dynamically generated, the TrieMaps won't grow unbounded and the current implementation without eviction is appropriate.

✏️ Learnings added

Learnt from: piyush-zlai PR: zipline-ai/chronon#657 File: online/src/main/scala/ai/chronon/online/metrics/OtelMetricsReporter.scala:30-34 Timestamp: 2025-04-21T15:12:30.670Z Learning: In the OtelMetricsReporter class, the instrument caches (counters, gauges, histograms) don't need eviction mechanisms because metric names are fixed and explicitly configured in the code, not dynamically generated.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

coderabbitai · 2025-04-18T21:58:25Z

online/src/main/scala/ai/chronon/online/metrics/Metrics.scala

+      joinNames.foreach(addTag(Tag.Join, _))
+
+      val groupByName = Option(groupBy).map(_.sanitize)
+      groupByName.foreach(addTag(Tag.GroupBy, _))
+


🛠️ Refactor suggestion

Only last join tag survives

addTag(Tag.Join, _) inside a loop overwrites previous values; multi‑join contexts lose information.
Aggregating (joinNames.mkString(",")) before a single addTag preserves all joins while keeping one key.

…et run after click.option envvar (#675) ^^^ setting the `artifact_prefix` with the env var after the fact ## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update

## Summary - Need the full table name here. - Also test the filter predicate ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Bug Fixes** - Improved handling of source table names used in join operations, ensuring more accurate or fully qualified table references. - Adjusted partition filter handling for BigQuery integration to prevent unwanted string modifications. - **Tests** - Added a new (currently ignored) test for verifying BigQuery partition pushdown functionality.   --------- Co-authored-by: Thomas Chow <[email protected]>

## Summary ^^^ ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [x] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced a new "CHECK-PARTITIONS" validation step in the GCP quickstart script to verify data partitions before upload. - **Chores** - Improved internal handling of input arguments within the GCP runner for future use.

## Summary Adding logic to dedupe in case we got multiple dependencies for a given node. Note: untested. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Bug Fixes** - Improved handling of Airflow dependencies by ensuring duplicate entries are removed from dependency lists displayed in metadata. - Streamlined script execution by removing temporary output files and job ID extraction for cleaner command runs.  --------- Co-authored-by: ezvz <[email protected]> Co-authored-by: David Han <[email protected]>

## Summary The job does not support multi-window and multiple labels ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Refactor** - Removed internal validation checks related to label join parts and aggregations, simplifying the label parts handling process. No changes to the user interface or exported functions.  Co-authored-by: ezvz <[email protected]>

service/BUILD.bazel

## Summary - Creates a `catalog` module to replace the `format` one we had. - Add a new bazel lib target for the `catalog` module - Move TableUtils there, it should live alongside the formatprovider stuff. - Rename imports - Followup is to remove the dependency on the main `spark` lib. This requires decoupling with `Extensions`. Then we can drastically slim the deps down in cloud gcp. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  Co-authored-by: Thomas Chow <[email protected]>

service_commons/BUILD.bazel

## Summary To get us ready to being able to support production streaming workloads on Flink and Dataproc, we need to have metrics published. [OpsAgent](https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent) is the seeming paved path to reporting metrics from GCloud hosts and OpsAgent does have some support for some of [Flink's metrics reporters](https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/metric_reporters/). Ended up going with prometheus as that is supported in our current version and OspAgent seems to have decent support for it. This PR can go independently of the infra PR that does the metrics scraping - zipline-ai/infrastructure#46 Tested manually and was able to confirm that metrics are making their way to cloud monitoring - ![Screenshot 2025-04-21 at 3 33 55 PM](https://github.com/user-attachments/assets/463c491f-5d32-4ef1-8a7d-62be093e7e93) ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Enabled Prometheus metrics reporting for Flink jobs, allowing improved monitoring and observability. - **Chores** - Added Prometheus metrics library as a dependency to the build configuration.

## Summary Queueing up a change to bump our BT gax and fetcher thread pools back up now that we've cut some redundancies & David's been able to clear out some TPs in MMX. ## Checklist - [x] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Performance Improvements** - Increased thread pool sizes in relevant components to enhance concurrent processing capabilities.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (5)

online/src/main/scala/ai/chronon/online/metrics/Metrics.scala (5)

205-236: ⚠️ Potential issue

toTags never consumed – metrics lose all context

All reporter methods use empty maps instead of the context's tags.

-    def increment(metric: String): Unit = Context.client.count(prefix(metric), 1, Map.empty)
+    def increment(metric: String): Unit = Context.client.count(prefix(metric), 1, toTags)

-    def distribution(metric: String, value: Long): Unit =
-      Context.client.distribution(prefix(metric), value, Map.empty)
+    def distribution(metric: String, value: Long): Unit =
+      Context.client.distribution(prefix(metric), value, toTags)

-    def count(metric: String, value: Long): Unit = Context.client.count(prefix(metric), value)
+    def count(metric: String, value: Long): Unit = Context.client.count(prefix(metric), value, toTags)

-    def gauge(metric: String, value: Long): Unit = Context.client.longGauge(prefix(metric), value)
+    def gauge(metric: String, value: Long): Unit = Context.client.longGauge(prefix(metric), value, toTags)

-    def gauge(metric: String, value: Double): Unit = Context.client.doubleGauge(prefix(metric), value)
+    def gauge(metric: String, value: Double): Unit = Context.client.doubleGauge(prefix(metric), value, toTags)

176-201: ⚠️ Potential issue

Only last join tag survives

addTag(Tag.Join, _) inside a loop overwrites previous values; multi‑join contexts lose information.

Fix by aggregating join names before a single addTag call:

-      joinNames.foreach(addTag(Tag.Join, _))
+      if (joinNames.nonEmpty) {
+        addTag(Tag.Join, joinNames.mkString(","))
+      }

222-222: ⚠️ Potential issue

Merge with context tags

Exception tags should merge with context tags, not replace them.

-      Context.client.count(prefix(Name.Exception), 1, Map(Metrics.Name.Exception -> exceptionSignature))
+      Context.client.count(prefix(Name.Exception), 1, toTags + (Metrics.Name.Exception -> exceptionSignature))

228-229: ⚠️ Potential issue

Merge with context tags

Additional tags should merge with context tags, not replace them.

-    def distribution(metric: String, value: Long, additionalTags: Map[String, String]): Unit =
-      Context.client.distribution(prefix(metric), value, additionalTags)
+    def distribution(metric: String, value: Long, additionalTags: Map[String, String]): Unit =
+      Context.client.distribution(prefix(metric), value, toTags ++ additionalTags)

207-208: ⚠️ Potential issue

Merge with context tags

Additional tags should merge with context tags, not replace them.

-    def increment(metric: String, additionalTags: Map[String, String]): Unit =
-      Context.client.count(prefix(metric), 1, additionalTags)
+    def increment(metric: String, additionalTags: Map[String, String]): Unit =
+      Context.client.count(prefix(metric), 1, toTags ++ additionalTags)

🧹 Nitpick comments (3)

online/src/main/scala/ai/chronon/online/metrics/FlexibleExecutionContext.scala (1)

38-44: Queue capacity increased by 10x

The queue capacity was significantly increased from 1000 to 10000 items. While this prevents task rejection under high load, monitor memory usage.

online/src/main/scala/ai/chronon/online/metrics/InstrumentedThreadPoolExecutor.scala (1)

30-30: Consider constructor-injected metrics context

Fixed context limits reusability. Consider accepting it as a constructor parameter.

-class InstrumentedThreadPoolExecutor(corePoolSize: Int,
+class InstrumentedThreadPoolExecutor(corePoolSize: Int,
                                     maximumPoolSize: Int,
                                     keepAliveTime: Long,
                                     unit: TimeUnit,
                                     workQueue: BlockingQueue[Runnable],
                                     threadFactory: ThreadFactory,
+                                    metricsContext: Metrics.Context = Metrics.Context(Metrics.Environment.Fetcher).withSuffix("threadpool"),
                                     metricsIntervalSeconds: Int = 15)
   extends ThreadPoolExecutor(
     corePoolSize,
     maximumPoolSize,
     keepAliveTime,
     unit,
     workQueue,
     threadFactory
   ) {
 // Reporter for periodic metrics
 private val metricsReporter: ScheduledExecutorService = buildMetricsScheduledExecutor()

-protected val metricsContext: Metrics.Context = Metrics.Context(Metrics.Environment.Fetcher).withSuffix("threadpool")

online/src/main/scala/ai/chronon/online/metrics/Metrics.scala (1)

130-146: Add error handling for OtelMetricsReporter creation

Current implementation will throw exceptions on initialization failures. Consider adding fallback.

   private val client: MetricsReporter = {
     // Can disable metrics collection for local / dev environments
     val metricsEnabled: Boolean = System.getProperty(MetricsEnabled, "true").toBoolean
     val reporter: String = System.getProperty(MetricsReporter, "otel")

-    reporter.toLowerCase match {
-      case "otel" | "opentelemetry" =>
-        if (metricsEnabled) {
-          val metricReader = OtelMetricsReporter.buildOtelMetricReader()
-          val openTelemetry = OtelMetricsReporter.buildOpenTelemetryClient(metricReader)
-          new OtelMetricsReporter(openTelemetry)
-        } else {
-          new OtelMetricsReporter(OpenTelemetry.noop())
-        }
-      case _ =>
-        throw new IllegalArgumentException(s"Unknown metrics reporter: $reporter. Only opentelemetry is supported.")
+    try {
+      reporter.toLowerCase match {
+        case "otel" | "opentelemetry" =>
+          if (metricsEnabled) {
+            val metricReader = OtelMetricsReporter.buildOtelMetricReader()
+            val openTelemetry = OtelMetricsReporter.buildOpenTelemetryClient(metricReader)
+            new OtelMetricsReporter(openTelemetry)
+          } else {
+            new OtelMetricsReporter(OpenTelemetry.noop())
+          }
+        case _ =>
+          throw new IllegalArgumentException(s"Unknown metrics reporter: $reporter. Only opentelemetry is supported.")
+      }
+    } catch {
+      case e: Exception =>
+        org.slf4j.LoggerFactory.getLogger(getClass).error(s"Failed to initialize metrics: ${e.getMessage}", e)
+        new OtelMetricsReporter(OpenTelemetry.noop())
+    }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 9ef2fa9 and 54c82bd.

📒 Files selected for processing (15)

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala (17 hunks)
docker/fetcher/start.sh (2 hunks)
maven_install.json (37 hunks)
online/BUILD.bazel (4 hunks)
online/src/main/java/ai/chronon/online/JavaFetcher.java (1 hunks)
online/src/main/scala/ai/chronon/online/fetcher/Fetcher.scala (2 hunks)
online/src/main/scala/ai/chronon/online/metrics/FlexibleExecutionContext.scala (1 hunks)
online/src/main/scala/ai/chronon/online/metrics/InstrumentedThreadPoolExecutor.scala (1 hunks)
online/src/main/scala/ai/chronon/online/metrics/Metrics.scala (6 hunks)
online/src/main/scala/ai/chronon/online/metrics/MetricsReporter.scala (1 hunks)
online/src/main/scala/ai/chronon/online/metrics/OtelMetricsReporter.scala (1 hunks)
online/src/test/scala/ai/chronon/online/test/TagsTest.scala (2 hunks)
service_commons/BUILD.bazel (1 hunks)
service_commons/src/main/java/ai/chronon/service/ChrononServiceLauncher.java (1 hunks)
tools/build_rules/dependencies/maven_repository.bzl (2 hunks)

✅ Files skipped from review due to trivial changes (2)

online/src/main/java/ai/chronon/online/JavaFetcher.java
online/src/main/scala/ai/chronon/online/fetcher/Fetcher.scala

🚧 Files skipped from review as they are similar to previous changes (8)

service_commons/BUILD.bazel
online/src/test/scala/ai/chronon/online/test/TagsTest.scala
online/BUILD.bazel
docker/fetcher/start.sh
service_commons/src/main/java/ai/chronon/service/ChrononServiceLauncher.java
tools/build_rules/dependencies/maven_repository.bzl
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala
online/src/main/scala/ai/chronon/online/metrics/OtelMetricsReporter.scala

🧰 Additional context used

🧬 Code Graph Analysis (1)

online/src/main/scala/ai/chronon/online/metrics/InstrumentedThreadPoolExecutor.scala (2)

online/src/main/scala/ai/chronon/online/metrics/Metrics.scala (3)

Metrics (26-237)

Environment (27-43)

withSuffix (165-165)

online/src/main/scala/ai/chronon/online/metrics/OtelMetricsReporter.scala (1)

distribution (73-78)

⏰ Context from checks skipped due to timeout of 90000ms (21)

GitHub Check: cloud_aws_tests
GitHub Check: service_tests
GitHub Check: service_tests
GitHub Check: online_tests
GitHub Check: service_commons_tests
GitHub Check: hub_tests
GitHub Check: hub_tests
GitHub Check: online_tests
GitHub Check: flink_tests
GitHub Check: cloud_gcp_tests
GitHub Check: cloud_gcp_tests
GitHub Check: cloud_aws_tests
GitHub Check: aggregator_tests
GitHub Check: flink_tests
GitHub Check: api_tests
GitHub Check: aggregator_tests
GitHub Check: orchestration_tests
GitHub Check: orchestration_tests
GitHub Check: scala_compile_fmt_fix
GitHub Check: api_tests
GitHub Check: enforce_triggered_workflows

🔇 Additional comments (40)

online/src/main/scala/ai/chronon/online/metrics/MetricsReporter.scala (1)

8-17: LGTM - Well-structured metrics interface

Clean, consistent method signatures with appropriate parameter defaults.

online/src/main/scala/ai/chronon/online/metrics/InstrumentedThreadPoolExecutor.scala (2)

66-89: LGTM - Well-implemented task instrumentation

Properly captures and reports wait time, execution time, and total time metrics.

91-100: LGTM - Proper resource cleanup

Correctly shuts down metrics reporter along with thread pool.

maven_install.json (37)

3-4: Metadata bump

1197-1203: OkHttp artifact added

1778-1784: Micrometer OTLP registry added

2131-2137: OpenTelemetry proto added

2147-2150: OTel API version bumped

2159-2249: OTel SDK modules updated to 1.49.0

2257-2298: Prometheus metrics modules added

4172-4185: Kotlin stdlib JDK7/JDK8 added

6420-6426: OkHttp/Okio dependency mapping

6738-6741: Micrometer OTLP mapping

7007-7024: OTel dependency mappings updated

8476-8482: Kotlin stdlib mapping updated

11312-11329: OkHttp package-to-class mapping

12166-12168: Micrometer OTLP package mapping

12540-12549: OTel proto package mapping

12578-12615: OTel exporter package mappings

12628-12631: SDK autoconfigure SPI mapping

12678-12699: Prometheus package mappings

24457-24463: Sources list includes OkHttp/Okio

24623-24624: Micrometer OTLP in sources

24729-24731: OTel proto in sources

24739-24747: OTel exporter in sources

24765-24775: Prometheus in sources

25298-25301: Kotlin JDK variants in sources

25925-25930: OkHttp/Okio in sources

26091-26092: Micrometer OTLP in sources

26197-26199: OTel proto in sources

26207-26215: OTel exporter in sources

26233-26243: Prometheus in sources

26766-26769: Kotlin JDK sources updated

27393-27398: OkHttp/Okio sources updated

27559-27560: Micrometer OTLP in sources

27665-27667: OTel proto in sources

27675-27683: OTel exporter sources

27698-27712: Prometheus sources list

28234-28237: Kotlin JDK in stdlib sources

30256-30303: SPI registrations for OTel exporters finalized

tchow-zlai and others added 30 commits February 13, 2025 23:27

refactor: fetcher sub package + kill old stats in fetcher (#423)

45bae2f

david-zlai and others added 4 commits April 18, 2025 08:42

Fix: remove error logs showing up (#671)

ae7fdd8

feat: reorder logs so we can see what we query before doing it (#670)

3991d28

coderabbitai bot reviewed Apr 18, 2025

View reviewed changes

david-zlai and others added 3 commits April 19, 2025 20:01

piyush-zlai changed the title ~~[draft] Swap our metrics provider from statsd to otel-metrics~~ Swap our metrics provider from statsd to otel-metrics Apr 21, 2025

piyush-zlai requested review from nikhil-zlai, tchow-zlai and kumar-zlai April 21, 2025 14:24

varant-zlai and others added 2 commits April 21, 2025 09:20

kumar-zlai reviewed Apr 21, 2025

View reviewed changes

service/BUILD.bazel Outdated Show resolved Hide resolved

kumar-zlai reviewed Apr 21, 2025

View reviewed changes

service_commons/BUILD.bazel Outdated Show resolved Hide resolved

kumar-zlai approved these changes Apr 22, 2025

View reviewed changes

piyush-zlai added 2 commits April 22, 2025 09:47

piyush-zlai force-pushed the piyush/metrics_reporter branch from 78cdbab to 9ef2fa9 Compare April 23, 2025 14:10

piyush-zlai added 3 commits April 25, 2025 09:11

Swap statsd for Otel metrics with support for http / prom readers

a8ae923

Add instrumented threadpool

297edfd

style: Apply scalafix and scalafmt changes

54c82bd

piyush-zlai force-pushed the piyush/metrics_reporter branch from 9ef2fa9 to 54c82bd Compare April 25, 2025 13:28

coderabbitai bot reviewed Apr 25, 2025

View reviewed changes

piyush-zlai changed the title ~~Swap our metrics provider from statsd to otel-metrics~~ Swap our metrics provider from statsd to otel-metrics + instrument our Fetcher threadpool Apr 25, 2025

Clean up import

fac9d84

kumar-zlai closed this May 1, 2025

kumar-zlai force-pushed the main branch from a969d44 to e6f8822 Compare May 1, 2025 05:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Swap our metrics provider from statsd to otel-metrics + instrument our Fetcher threadpool #657

Swap our metrics provider from statsd to otel-metrics + instrument our Fetcher threadpool #657

Uh oh!

piyush-zlai commented Apr 16, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 18, 2025

Uh oh!

coderabbitai bot Apr 18, 2025 •

edited

Loading

Uh oh!

piyush-zlai Apr 21, 2025

Uh oh!

coderabbitai bot Apr 21, 2025

Uh oh!

coderabbitai bot Apr 18, 2025 •

edited

Loading

Uh oh!

piyush-zlai Apr 21, 2025

Uh oh!

coderabbitai bot Apr 21, 2025

Uh oh!

coderabbitai bot Apr 18, 2025 •

edited

Loading

Uh oh!

piyush-zlai Apr 21, 2025

Uh oh!

coderabbitai bot Apr 21, 2025

Uh oh!

coderabbitai bot Apr 18, 2025

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Key	Purpose	Default/Example Value
`enabled`	Whether OTLP export is enabled	`true`
`endpoint`	The OTLP metrics endpoint (URL)	(none, must be set)
`step`	Frequency to report metrics (duration)	`PT1M` (1 minute)
`compression`	Compression type for payload (e.g., `gzip`)	`none`
`headers`	Additional headers for requests (key-value pairs)	(none)
`aggregation-temporality`	Preferred temporality for cumulative or delta data	`cumulative` or `delta`
`resource-attributes`	Resource attributes attached to all metrics (comma-separated key-value pairs)	See below
`timeout`	Request timeout (duration)	`10s`
`base-time-unit`	Base time unit for timers	`milliseconds`

Swap our metrics provider from statsd to otel-metrics + instrument our Fetcher threadpool #657

Swap our metrics provider from statsd to otel-metrics + instrument our Fetcher threadpool #657

Uh oh!

Conversation

piyush-zlai commented Apr 16, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Summary by CodeRabbit

Summary by CodeRabbit

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 18, 2025

Choose a reason for hiding this comment

Configuration Keys for OtlpMeterRegistry in Micrometer 1.12+

Common Configuration Keys

Example: Spring Boot Property Mapping

Notes

Uh oh!

coderabbitai bot Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

piyush-zlai Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

piyush-zlai Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

piyush-zlai Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

piyush-zlai commented Apr 16, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot Apr 18, 2025 •

edited

Loading

coderabbitai bot Apr 18, 2025 •

edited

Loading

coderabbitai bot Apr 18, 2025 •

edited

Loading