Skip to content

planner-2 #730

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 413 commits into
base: main
Choose a base branch
from
Open

planner-2 #730

wants to merge 413 commits into from

Conversation

varant-zlai
Copy link
Collaborator

@varant-zlai varant-zlai commented May 2, 2025

Summary

Checklist

  • Added Unit Tests
  • Covered by existing CI
  • Integration tested
  • Documentation update

Summary by CodeRabbit

  • New Features

    • Added GroupByOfflinePlanner and JoinOfflinePlanner classes for offline execution planning.
    • Introduced MetaDataUtils for metadata layering and normalization.
    • Added TableDependencies utilities to generate table dependencies from joins, group-bys, and sources.
    • Introduced PlanNode and Planner abstractions for execution planning.
    • Added PartitionSpecWithColumn to encapsulate partitioning details.
    • Added NodeRunner and BatchRunContext traits for batch execution context.
  • Improvements

    • Enhanced window arithmetic and utility methods for time units and window operations.
    • Updated naming conventions from "Keyword" suffix to "Folder" for configuration types, improving clarity.
    • Refined metadata handling with consistent layering and merging of execution info and environment.
    • Improved deduplication in group-by fetch requests.
  • Bug Fixes

    • Fixed logic for determining partition ranges and dependencies in temporal and snapshot data models.
  • Refactor

    • Replaced generic bootstrap job classes with join-specific JoinBootstrapJob.
    • Consolidated multiple parameters into context objects in batch job methods.
    • Restricted visibility of internal methods in label join processing.
    • Updated join utilities to leverage offline planners for naming consistency.
    • Removed obsolete configuration-to-node conversion code and associated tests.
  • Chores

    • Updated .gitignore to exclude additional directories.
    • Updated Thrift schemas with new fields and renamed structs for orchestration.
    • Replaced constants in multiple components from "Keyword" to "Folder" suffix for consistency.
  • Documentation

    • Improved inline comments and docstrings for better readability.
    • Enhanced Thrift schema documentation with detailed field descriptions.
  • Tests

    • Removed outdated test suite for deprecated dependency resolver logic.

nikhil-zlai and others added 30 commits February 9, 2025 23:09
## Summary

isEmpty is somewhat expensive operation as it needs a partial table
scan. For the most part in joins we allow for empty dataframes, so we
can optimize the common path.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Summary by CodeRabbit

- **Refactor**
- Refined internal logic to streamline condition evaluations and
consolidated diagnostic messaging for more effective system monitoring.
These optimizations simplify internal processing while ensuring a
consistent user experience with no visible changes to public features.
Enhanced logging now provides improved insights into system operations
without impacting functionality. This update improves overall system
efficiency and clarity.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

Co-authored-by: Thomas Chow <[email protected]>
## Summary
This PR wires up tiling support. Covers a few aspects:
* BigTable KV store changes to support tiling - we take requests for the
'_STREAMING' table for gets and puts using the TileKey thrift interface
and map to corresponding BT RowKey + timerange lookups. We've yanked out
event based support in the BT kv store. We're writing out data in the
Row + tile format documented here - [Option 1 - Tiles as Timestamped
Rows](https://docs.google.com/document/d/1wgzJVAkl5K1bBCr98WCZFiFeTTWqILdA3FTE7cz9Li4/edit?tab=t.0#bookmark=id.j54a5g8gj2m9).
* Add a Flag in the FlagStore to indicate if we're using Tiling / not.
Switched over the fetcher checks to use this instead of the prior
GrpByServingInfo.isTilingEnabled flag. Leverage this flag in Flink to
choose tiling / not. Set this flag to true in the GcpApi to always use
tiling.


## Checklist
- [X] Added Unit Tests
- [ ] Covered by existing CI
- [X] Integration tested - Tested on the Etsy side by running the job,
hitting some fetcher cli endpoints.
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced dynamic tiling capabilities for time series and streaming
data processing. This enhancement enables a configurable tiled data mode
that improves data retrieval granularity, processing consistency, and
overall query performance, resulting in more efficient and predictable
operations for end-users.
- Added new methods for constructing tile keys and row keys, enhancing
data management capabilities.
- Implemented flag-based control for enabling or disabling tiling in
various components, allowing for more flexible configurations.

- **Bug Fixes**
  - Corrected minor documentation errors in the FlagStore interface.

- **Tests**
- Expanded test coverage to validate new tiling functionalities and
ensure robustness in handling time series data.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: tchow-zlai <[email protected]>
Co-authored-by: Thomas Chow <[email protected]>
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
	- Enhanced table output handling to support partitioned tables.
- Introduced configurable options for temporary storage and integration
settings, improving cloud-based table materialization.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
- This is a better failure mode, we don't want to continue if theres'
something that's happening in the ananalysis phase.

## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Bug Fixes**
	- Refined error handling mechanism in join computation process
	- Improved exception propagation during unexpected errors

The changes focus on streamlining error management with a more direct
approach to handling unexpected exceptions during join operations.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
## Summary

- bulked out eval to run sources inside join / group_by etc. 

- removed need for separate gateway setup and maintenance.

- added support for sampling dependent tables to local_warehouse.

- deleted some dead code.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [x] Integration tested (on etsy confs)
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Bug Fixes**
  - Improved error feedback clarity during data sampling.

- **New Features**
  - Increased data sampling limits for improved performance.
  - Enhanced SQL query handling with new date filtering conditions.

- **Refactor**
- Streamlined SQL query generation for table scans, ensuring valid
queries under various conditions.
- Deprecated outdated sampling functionality to enhance overall
maintainability.

- **Chores**
- Disabled unnecessary operations in the build and upload script for
Google Cloud Storage.

- **Style**
- Added logging for improved traceability of filtering conditions in
DataFrame scans.

- **Tests**
  - Removed unit tests for the Flow and Node classes.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced data processing by introducing new configuration options for
writing data, including support for Parquet as the intermediate format
and enabling list inference during write operations.
- Expanded selection of fields in purchase events with the addition of
`bucket_rand`.
- Introduced a new aggregation to calculate the last 15 purchase prices,
utilizing the newly added `bucket_rand` field.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
## Summary
^^^

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced logging now delivers color-coded outputs and adjusted log
levels for clearer visibility.
- Upgraded service versioning supports stable, production-ready
deployments.

- **Chores**
- Modernized the build and deployment pipeline to improve artifact
handling.
- Refined dependency management to bolster advanced logging
capabilities.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
Adds support for creating a new `.bazelrc.local` file specifying custom
build/test bazel options which can be used for passing gcloud auth
credentials

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Updated the build configuration to optionally load a user-specific
settings file, replacing the automatic use of preset credentials.
- **Documentation**
- Enhanced guidance with a new section detailing steps for setting up
personal authentication credentials.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
Modified our github workflow to run scalaFmt checks using bazel instead
of sbt and deleted the build.sbt file as it's no longer needed now.

## Checklist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Chores**
- Streamlined build and continuous integration setups, transitioning
away from legacy tooling.
- Modernized internal infrastructure for improved consistency and
stability.

- **Refactor / Style**
- Enhanced code readability with comprehensive cosmetic and
documentation updates.
- Unified formatting practices across the codebase to support future
maintainability.
- Adjusted formatting of comments and code blocks for improved clarity
without altering functionality.

- **Tests**
- Reformatted test suites for clarity and consistency while preserving
all functional behaviors.
- Improved formatting in various test cases and methods for better
readability without altering functionality.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

Based on Slack
[discussion](https://zipline-2kh4520.slack.com/archives/C0880ECQ0EN/p1739304132253249)

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update


<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Introduced an optional attribute to enhance node classification with
more detailed physical characteristics for improved metadata
representation.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Sean Lynch <[email protected]>
## Summary
Updated zpush script with bazel scalafmt in our dev notes.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Documentation**
  - Enhanced guidelines for formatting and pushing Scala code.
- Replaced previous procedures with an updated method featuring detailed
error notifications.
  - Clarified the need for quoting multi-word commit messages.
- Adjusted the ordering of remote connectivity instructions for improved
clarity.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

- Adding new BUILD for cloud_aws
- Adding the above to the CI/CD

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced AWS integration with a new library for AWS functionality and
a framework for job submission.
- Introduced a new utility for managing job submissions, statuses, and
terminations.
- Added dedicated triggers for cloud modules to improve workflow
automation.
- **Tests**
- Improved testing coverage with additional utilities for validating
cloud functionalities and increased timeout settings for asynchronous
operations.
- **Chores**
- Updated dependency configurations to incorporate essential AWS SDK
components.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
… fix gbu for bigquery (#365)

## Summary
^^^

This is being done because the current chronon engine assumes partition
field column is a string type but the partition field of bigquery native
tables are a date type.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced data processing by automatically formatting date-based
partition columns, ensuring robust handling of partitioned native tables
for more reliable data scanning.
- Simplified retrieval of required columns in the `buildServingInfo`
method, improving efficiency by directly selecting columns using a dummy
table.
- **Bug Fixes**
- Improved logging for the scanning process, providing better
traceability during data operations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
While running our Flink jobs we do see periodic restarts as we're low on
direct memory. Direct mem is required by Kafka consumer clients as well
as BigTable's client sdk. Flink's default seems to be [0
bytes](https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/config/).
Bumping this a bit to 1G seems to result in the jobs running without
restarting every hour.

Before:
![Screenshot 2025-02-11 at 4 24
30 PM](https://github.com/user-attachments/assets/75f88687-9ecf-4fc4-b89c-e863be3ee1ff)

After:
![Screenshot 2025-02-11 at 9 16
48 AM](https://github.com/user-attachments/assets/bc66aa0a-1b92-4b46-a78d-0c70168288d7)


## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [X] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Improved memory allocation for job processing, allocating additional
off-heap memory to enhance performance and reliability for applications
with high memory demands.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Bug Fixes**
- Improved handling of date-based partition columns during table
processing to ensure data is formatted and consolidated accurately.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced the table creation process to return clear, detailed
statuses, improving feedback during table building.
- Introduced a new method for generating table builders that integrates
with BigQuery, including error handling for partitioning.
- Streamlined data writing operations to cloud storage with automatic
path configuration and Parquet integration.
- Added explicit partitioning for DataFrame saves in Hive, Delta, and
Iceberg formats.
  
- **Refactor**
- Overhauled logic to enforce partition restrictions and incorporate
robust error handling for a smoother user experience.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: tchow-zlai <[email protected]>
Co-authored-by: Thomas Chow <[email protected]>
…374)

## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Improved the cloud upload process to include additional metadata with
each file, enhancing traceability and information capture.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Refactor**
- Modified join analysis behavior to disable automatic table permission
checks by default, simplifying operations. Users can now explicitly
enable permission validation when needed.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

Co-authored-by: Thomas Chow <[email protected]>
…e do it in tableutils scandf (#368)

…e do it in tableutils scandf

## Summary

Doing this PR because of #365

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Refactor**
- Simplified the data join processing to eliminate redundant
transformations, ensuring a more streamlined handling of left-side data
during join operations.
- Updated underlying logic to adjust how partition details are managed,
which may influence the output schema in data processing workflows.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
#359)

## Summary

- also refactored out google-crc32c because it was slow due to it
falling back to the non C implementation. using a different library

Tested here:

```
(tmp_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/cananry-confs (davidhan/canary) $ zipline run --conf production/group_bys/quickstart/purchases.v1_test --dataproc
/Users/davidhan/zipline/chronon/tmp_chronon/lib/python3.13/site-packages/google_crc32c/__init__.py:29: RuntimeWarning: As the c extension couldn't be imported, `google-crc32c` is using a pure python implementation that is significantly slower. If possible, please configure a c build environment and compile the extension
  warnings.warn(_SLOW_CRC32C_WARNING, RuntimeWarning)
Running with args: {'conf': 'production/group_bys/quickstart/purchases.v1_test', 'dataproc': True, 'env': 'dev', 'mode': None, 'ds': None, 'app_name': None, 'start_ds': None, 'end_ds': None, 'parallelism': None, 'repo': '.', 'online_jar': 'cloud_gcp_lib_deploy.jar', 'online_class': 'ai.chronon.integrations.cloud_gcp.GcpApiImpl', 'version': None, 'spark_version': '2.4.0', 'spark_submit_path': None, 'spark_streaming_submit_path': None, 'online_jar_fetch': None, 'sub_help': False, 'conf_type': None, 'online_args': None, 'chronon_jar': None, 'release_tag': None, 'list_apps': None, 'render_info': None, 'groupby_name': None, 'kafka_bootstrap': None, 'mock_source': False, 'savepoint_uri': None}
Setting env variables:
From <common_env> setting VERSION=latest
From <common_env> setting SPARK_SUBMIT_PATH=[TODO]/path/to/spark-submit
From <common_env> setting JOB_MODE=local[*]
From <common_env> setting HADOOP_DIR=[STREAMING-TODO]/path/to/folder/containing
From <common_env> setting CHRONON_ONLINE_CLASS=[ONLINE-TODO]your.online.class
From <common_env> setting CHRONON_ONLINE_ARGS=[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT>
From <common_env> setting PARTITION_COLUMN=ds
From <common_env> setting PARTITION_FORMAT=yyyy-MM-dd
From <common_env> setting CUSTOMER_ID=canary
From <common_env> setting GCP_PROJECT_ID=canary-443022
From <common_env> setting GCP_REGION=us-central1
From <common_env> setting GCP_DATAPROC_CLUSTER_NAME=zipline-canary-cluster
From <common_env> setting GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance
From <cli_args> setting APP_NAME=chronon
From <cli_args> setting CHRONON_ONLINE_JAR=cloud_gcp_lib_deploy.jar
Local hash of /tmp/zipline/cloud_gcp_submitter_deploy.jar: Inl1LA==. GCS file jars/cloud_gcp_submitter_deploy.jar hash: Inl1LA==
/tmp/zipline/cloud_gcp_submitter_deploy.jar matches GCS zipline-artifacts-canary/jars/cloud_gcp_submitter_deploy.jar
File production/group_bys/quickstart/purchases.v1_test uploaded to metadata/purchases.v1_test in bucket zipline-warehouse-canary.
Running command: java -cp /tmp/zipline/cloud_gcp_submitter_deploy.jar ai.chronon.integrations.cloud_gcp.DataprocSubmitter group-by-backfill --conf-path=purchases.v1_test --end-date=2025-02-10  --conf-type=group_bys      --jar-uri=gs://zipline-artifacts-canary/jars/cloud_gcp_lib_deploy.jar --job-type=spark --main-class=ai.chronon.spark.Driver --additional-conf-path=additional-confs.yaml --gcs-files=gs://zipline-warehouse-canary/metadata/purchases.v1_test,gs://zipline-artifacts-canary/confs/additional-confs.yaml
WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.
Array(group-by-backfill, --conf-path=purchases.v1_test, --end-date=2025-02-10, --conf-type=group_bys, --additional-conf-path=additional-confs.yaml, --is-gcp, --gcp-project-id=canary-443022, --gcp-bigtable-instance-id=zipline-canary-instance)
Dataproc submitter job id: 1e5c75a3-5697-44e9-a65d-831b7c526108
Safe to exit. Follow the job status at: https://console.cloud.google.com/dataproc/jobs/1e5c75a3-5697-44e9-a65d-831b7c526108

                    <-----------------------------------------------------------------------------------
                    ------------------------------------------------------------------------------------                            
                                                      DATAPROC LOGS   
                    ------------------------------------------------------------------------------------                             
                    ------------------------------------------------------------------------------------>
                    
Running command: gcloud dataproc jobs wait  1e5c75a3-5697-44e9-a65d-831b7c526108 --region=us-central1
Waiting for job output...
25/02/11 03:03:35 WARN SparkConf: The configuration key 'spark.yarn.executor.failuresValidityInterval' has been deprecated as of Spark 3.5 and may be removed in the future. Please use the new key 'spark.executor.failuresValidityInterval' instead.
25/02/11 03:03:35 WARN SparkConf: The configuration key 'spark.yarn.executor.failuresValidityInterval' has been deprecated as of Spark 3.5 and may be removed in the future. Please use the new key 'spark.executor.failuresValidityInterval' instead.
Using warehouse dir: /tmp/1e5c75a3-5697-44e9-a65d-831b7c526108/local_warehouse
25/02/11 03:03:38 INFO HiveConf: Found configuration file file:/etc/hive/conf.dist/hive-site.xml
25/02/11 03:03:38 WARN SparkConf: The configuration key 'spark.yarn.executor.failuresValidityInterval' has been deprecated as of Spark 3.5 and may be removed in the future. Please use the new key 'spark.executor.failuresValidityInterval' instead.
25/02/11 03:03:38 INFO SparkEnv: Registering MapOutputTracker
25/02/11 03:03:38 INFO SparkEnv: Registering BlockManagerMaster
25/02/11 03:03:38 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
25/02/11 03:03:38 INFO SparkEnv: Registering OutputCommitCoordinator
25/02/11 03:03:39 INFO DataprocSparkPlugin: Registered 188 driver metrics
25/02/11 03:03:39 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at zipline-canary-cluster-m.us-central1-c.c.canary-443022.internal./10.128.0.17:8032
25/02/11 03:03:39 INFO AHSProxy: Connecting to Application History server at zipline-canary-cluster-m.us-central1-c.c.canary-443022.internal./10.128.0.17:10200
25/02/11 03:03:40 INFO Configuration: resource-types.xml not found
25/02/11 03:03:40 INFO ResourceUtils: Unable to find 'resource-types.xml'.
25/02/11 03:03:41 INFO YarnClientImpl: Submitted application application_1738197659103_0071
25/02/11 03:03:42 WARN SparkConf: The configuration key 'spark.yarn.executor.failuresValidityInterval' has been deprecated as of Spark 3.5 and may be removed in the future. Please use the new key 'spark.executor.failuresValidityInterval' instead.
25/02/11 03:03:42 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at zipline-canary-cluster-m.us-central1-c.c.canary-443022.internal./10.128.0.17:8030
25/02/11 03:03:43 INFO GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
25/02/11 03:03:44 INFO GoogleHadoopOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will *not* yet see flushed data for gs://dataproc-temp-us-central1-703996152583-pqtvfptb/5d9e94ed-7649-4828-8b64-e3d58632a5d0/spark-job-history/application_1738197659103_0071.inprogress [CONTEXT ratelimit_period="1 MINUTES" ]
2025/02/11 03:03:44 INFO  SparkSessionBuilder.scala:75 - Chronon logging system initialized. Overrides spark's configuration
2025/02/11 03:04:01 INFO  TableUtils.scala:195 - Found 29, between (2023-11-02, 2023-11-30) partitions for table: canary-443022.data.quickstart_purchases_v1_test
2025/02/11 03:04:10 INFO  TableUtils.scala:195 - Found 30, between (2023-11-01, 2023-11-30) partitions for table: data.purchases
2025/02/11 03:04:10 INFO  TableUtils.scala:619 - 
Unfilled range computation:
   Output table: canary-443022.data.quickstart_purchases_v1_test
   Missing output partitions: [2023-12-01,2023-12-02,2023-12-03,2023-12-04,2023-12-05,2023-12-06,2023-12-07,2023-12-08,2023-12-09,2023-12-10,2023-12-11,2023-12-12,2023-12-13,2023-12-14,2023-12-15,2023-12-16,2023-12-17,2023-12-18,2023-12-19,2023-12-20,2023-12-21,2023-12-22,2023-12-23,2023-12-24,2023-12-25,2023-12-26,2023-12-27,2023-12-28,2023-12-29,2023-12-30,2023-12-31,2024-01-01,2024-01-02,2024-01-03,2024-01-04,2024-01-05,2024-01-06,2024-01-07,2024-01-08,2024-01-09,2024-01-10,2024-01-11,2024-01-12,2024-01-13,2024-01-14,2024-01-15,2024-01-16,2024-01-17,2024-01-18,2024-01-19,2024-01-20,2024-01-21,2024-01-22,2024-01-23,2024-01-24,2024-01-25,2024-01-26,2024-01-27,2024-01-28,2024-01-29,2024-01-30,2024-01-31,2024-02-01,2024-02-02,2024-02-03,2024-02-04,2024-02-05,2024-02-06,2024-02-07,2024-02-08,2024-02-09,2024-02-10,2024-02-11,2024-02-12,2024-02-13,2024-02-14,2024-02-15,2024-02-16,2024-02-17,2024-02-18,2024-02-19,2024-02-20,2024-02-21,2024-02-22,2024-02-23,2024-02-24,2024-02-25,2024-02-26,2024-02-27,2024-02-28,2024-02-29,2024-03-01,2024-03-02,2024-03-03,2024-03-04,2024-03-05,2024-03-06,2024-03-07,2024-03-08,2024-03-09,2024-03-10,2024-03-11,2024-03-12,2024-03-13,2024-03-14,2024-03-15,2024-03-16,2024-03-17,2024-03-18,2024-03-19,2024-03-20,2024-03-21,2024-03-22,2024-03-23,2024-03-24,2024-03-25,2024-03-26,2024-03-27,2024-03-28,2024-03-29,2024-03-30,2024-03-31,2024-04-01,2024-04-02,2024-04-03,2024-04-04,2024-04-05,2024-04-06,2024-04-07,2024-04-08,2024-04-09,2024-04-10,2024-04-11,2024-04-12,2024-04-13,2024-04-14,2024-04-15,2024-04-16,2024-04-17,2024-04-18,2024-04-19,2024-04-20,2024-04-21,2024-04-22,2024-04-23,2024-04-24,2024-04-25,2024-04-26,2024-04-27,2024-04-28,2024-04-29,2024-04-30,2024-05-01,2024-05-02,2024-05-03,2024-05-04,2024-05-05,2024-05-06,2024-05-07,2024-05-08,2024-05-09,2024-05-10,2024-05-11,2024-05-12,2024-05-13,2024-05-14,2024-05-15,2024-05-16,2024-05-17,2024-05-18,2024-05-19,2024-05-20,2024-05-21,2024-05-22,2024-05-23,2024-05-24,2024-05-25,2024-05-26,2024-05-27,2024-05-28,2024-05-29,2024-05-30,2024-05-31,2024-06-01,2024-06-02,2024-06-03,2024-06-04,2024-06-05,2024-06-06,2024-06-07,2024-06-08,2024-06-09,2024-06-10,2024-06-11,2024-06-12,2024-06-13,2024-06-14,2024-06-15,2024-06-16,2024-06-17,2024-06-18,2024-06-19,2024-06-20,2024-06-21,2024-06-22,2024-06-23,2024-06-24,2024-06-25,2024-06-26,2024-06-27,2024-06-28,2024-06-29,2024-06-30,2024-07-01,2024-07-02,2024-07-03,2024-07-04,2024-07-05,2024-07-06,2024-07-07,2024-07-08,2024-07-09,2024-07-10,2024-07-11,2024-07-12,2024-07-13,2024-07-14,2024-07-15,2024-07-16,2024-07-17,2024-07-18,2024-07-19,2024-07-20,2024-07-21,2024-07-22,2024-07-23,2024-07-24,2024-07-25,2024-07-26,2024-07-27,2024-07-28,2024-07-29,2024-07-30,2024-07-31,2024-08-01,2024-08-02,2024-08-03,2024-08-04,2024-08-05,2024-08-06,2024-08-07,2024-08-08,2024-08-09,2024-08-10,2024-08-11,2024-08-12,2024-08-13,2024-08-14,2024-08-15,2024-08-16,2024-08-17,2024-08-18,2024-08-19,2024-08-20,2024-08-21,2024-08-22,2024-08-23,2024-08-24,2024-08-25,2024-08-26,2024-08-27,2024-08-28,2024-08-29,2024-08-30,2024-08-31,2024-09-01,2024-09-02,2024-09-03,2024-09-04,2024-09-05,2024-09-06,2024-09-07,2024-09-08,2024-09-09,2024-09-10,2024-09-11,2024-09-12,2024-09-13,2024-09-14,2024-09-15,2024-09-16,2024-09-17,2024-09-18,2024-09-19,2024-09-20,2024-09-21,2024-09-22,2024-09-23,2024-09-24,2024-09-25,2024-09-26,2024-09-27,2024-09-28,2024-09-29,2024-09-30,2024-10-01,2024-10-02,2024-10-03,2024-10-04,2024-10-05,2024-10-06,2024-10-07,2024-10-08,2024-10-09,2024-10-10,2024-10-11,2024-10-12,2024-10-13,2024-10-14,2024-10-15,2024-10-16,2024-10-17,2024-10-18,2024-10-19,2024-10-20,2024-10-21,2024-10-22,2024-10-23,2024-10-24,2024-10-25,2024-10-26,2024-10-27,2024-10-28,2024-10-29,2024-10-30,2024-10-31,2024-11-01,2024-11-02,2024-11-03,2024-11-04,2024-11-05,2024-11-06,2024-11-07,2024-11-08,2024-11-09,2024-11-10,2024-11-11,2024-11-12,2024-11-13,2024-11-14,2024-11-15,2024-11-16,2024-11-17,2024-11-18,2024-11-19,2024-11-20,2024-11-21,2024-11-22,2024-11-23,2024-11-24,2024-11-25,2024-11-26,2024-11-27,2024-11-28,2024-11-29,2024-11-30,2024-12-01,2024-12-02,2024-12-03,2024-12-04,2024-12-05,2024-12-06,2024-12-07,2024-12-08,2024-12-09,2024-12-10,2024-12-11,2024-12-12,2024-12-13,2024-12-14,2024-12-15,2024-12-16,2024-12-17,2024-12-18,2024-12-19,2024-12-20,2024-12-21,2024-12-22,2024-12-23,2024-12-24,2024-12-25,2024-12-26,2024-12-27,2024-12-28,2024-12-29,2024-12-30,2024-12-31,2025-01-01,2025-01-02,2025-01-03,2025-01-04,2025-01-05,2025-01-06,2025-01-07,2025-01-08,2025-01-09,2025-01-10,2025-01-11,2025-01-12,2025-01-13,2025-01-14,2025-01-15,2025-01-16,2025-01-17,2025-01-18,2025-01-19,2025-01-20,2025-01-21,2025-01-22,2025-01-23,2025-01-24,2025-01-25,2025-01-26,2025-01-27,2025-01-28,2025-01-29,2025-01-30,2025-01-31,2025-02-01,2025-02-02,2025-02-03,2025-02-04,2025-02-05,2025-02-06,2025-02-07,2025-02-08,2025-02-09,2025-02-10]
   Input tables: data.purchases
   Missing input partitions: [2023-12-01,2023-12-02,2023-12-03,2023-12-04,2023-12-05,2023-12-06,2023-12-07,2023-12-08,2023-12-09,2023-12-10,2023-12-11,2023-12-12,2023-12-13,2023-12-14,2023-12-15,2023-12-16,2023-12-17,2023-12-18,2023-12-19,2023-12-20,2023-12-21,2023-12-22,2023-12-23,2023-12-24,2023-12-25,2023-12-26,2023-12-27,2023-12-28,2023-12-29,2023-12-30,2023-12-31,2024-01-01,2024-01-02,2024-01-03,2024-01-04,2024-01-05,2024-01-06,2024-01-07,2024-01-08,2024-01-09,2024-01-10,2024-01-11,2024-01-12,2024-01-13,2024-01-14,2024-01-15,2024-01-16,2024-01-17,2024-01-18,2024-01-19,2024-01-20,2024-01-21,2024-01-22,2024-01-23,2024-01-24,2024-01-25,2024-01-26,2024-01-27,2024-01-28,2024-01-29,2024-01-30,2024-01-31,2024-02-01,2024-02-02,2024-02-03,2024-02-04,2024-02-05,2024-02-06,2024-02-07,2024-02-08,2024-02-09,2024-02-10,2024-02-11,2024-02-12,2024-02-13,2024-02-14,2024-02-15,2024-02-16,2024-02-17,2024-02-18,2024-02-19,2024-02-20,2024-02-21,2024-02-22,2024-02-23,2024-02-24,2024-02-25,2024-02-26,2024-02-27,2024-02-28,2024-02-29,2024-03-01,2024-03-02,2024-03-03,2024-03-04,2024-03-05,2024-03-06,2024-03-07,2024-03-08,2024-03-09,2024-03-10,2024-03-11,2024-03-12,2024-03-13,2024-03-14,2024-03-15,2024-03-16,2024-03-17,2024-03-18,2024-03-19,2024-03-20,2024-03-21,2024-03-22,2024-03-23,2024-03-24,2024-03-25,2024-03-26,2024-03-27,2024-03-28,2024-03-29,2024-03-30,2024-03-31,2024-04-01,2024-04-02,2024-04-03,2024-04-04,2024-04-05,2024-04-06,2024-04-07,2024-04-08,2024-04-09,2024-04-10,2024-04-11,2024-04-12,2024-04-13,2024-04-14,2024-04-15,2024-04-16,2024-04-17,2024-04-18,2024-04-19,2024-04-20,2024-04-21,2024-04-22,2024-04-23,2024-04-24,2024-04-25,2024-04-26,2024-04-27,2024-04-28,2024-04-29,2024-04-30,2024-05-01,2024-05-02,2024-05-03,2024-05-04,2024-05-05,2024-05-06,2024-05-07,2024-05-08,2024-05-09,2024-05-10,2024-05-11,2024-05-12,2024-05-13,2024-05-14,2024-05-15,2024-05-16,2024-05-17,2024-05-18,2024-05-19,2024-05-20,2024-05-21,2024-05-22,2024-05-23,2024-05-24,2024-05-25,2024-05-26,2024-05-27,2024-05-28,2024-05-29,2024-05-30,2024-05-31,2024-06-01,2024-06-02,2024-06-03,2024-06-04,2024-06-05,2024-06-06,2024-06-07,2024-06-08,2024-06-09,2024-06-10,2024-06-11,2024-06-12,2024-06-13,2024-06-14,2024-06-15,2024-06-16,2024-06-17,2024-06-18,2024-06-19,2024-06-20,2024-06-21,2024-06-22,2024-06-23,2024-06-24,2024-06-25,2024-06-26,2024-06-27,2024-06-28,2024-06-29,2024-06-30,2024-07-01,2024-07-02,2024-07-03,2024-07-04,2024-07-05,2024-07-06,2024-07-07,2024-07-08,2024-07-09,2024-07-10,2024-07-11,2024-07-12,2024-07-13,2024-07-14,2024-07-15,2024-07-16,2024-07-17,2024-07-18,2024-07-19,2024-07-20,2024-07-21,2024-07-22,2024-07-23,2024-07-24,2024-07-25,2024-07-26,2024-07-27,2024-07-28,2024-07-29,2024-07-30,2024-07-31,2024-08-01,2024-08-02,2024-08-03,2024-08-04,2024-08-05,2024-08-06,2024-08-07,2024-08-08,2024-08-09,2024-08-10,2024-08-11,2024-08-12,2024-08-13,2024-08-14,2024-08-15,2024-08-16,2024-08-17,2024-08-18,2024-08-19,2024-08-20,2024-08-21,2024-08-22,2024-08-23,2024-08-24,2024-08-25,2024-08-26,2024-08-27,2024-08-28,2024-08-29,2024-08-30,2024-08-31,2024-09-01,2024-09-02,2024-09-03,2024-09-04,2024-09-05,2024-09-06,2024-09-07,2024-09-08,2024-09-09,2024-09-10,2024-09-11,2024-09-12,2024-09-13,2024-09-14,2024-09-15,2024-09-16,2024-09-17,2024-09-18,2024-09-19,2024-09-20,2024-09-21,2024-09-22,2024-09-23,2024-09-24,2024-09-25,2024-09-26,2024-09-27,2024-09-28,2024-09-29,2024-09-30,2024-10-01,2024-10-02,2024-10-03,2024-10-04,2024-10-05,2024-10-06,2024-10-07,2024-10-08,2024-10-09,2024-10-10,2024-10-11,2024-10-12,2024-10-13,2024-10-14,2024-10-15,2024-10-16,2024-10-17,2024-10-18,2024-10-19,2024-10-20,2024-10-21,2024-10-22,2024-10-23,2024-10-24,2024-10-25,2024-10-26,2024-10-27,2024-10-28,2024-10-29,2024-10-30,2024-10-31,2024-11-01,2024-11-02,2024-11-03,2024-11-04,2024-11-05,2024-11-06,2024-11-07,2024-11-08,2024-11-09,2024-11-10,2024-11-11,2024-11-12,2024-11-13,2024-11-14,2024-11-15,2024-11-16,2024-11-17,2024-11-18,2024-11-19,2024-11-20,2024-11-21,2024-11-22,2024-11-23,2024-11-24,2024-11-25,2024-11-26,2024-11-27,2024-11-28,2024-11-29,2024-11-30,2024-12-01,2024-12-02,2024-12-03,2024-12-04,2024-12-05,2024-12-06,2024-12-07,2024-12-08,2024-12-09,2024-12-10,2024-12-11,2024-12-12,2024-12-13,2024-12-14,2024-12-15,2024-12-16,2024-12-17,2024-12-18,2024-12-19,2024-12-20,2024-12-21,2024-12-22,2024-12-23,2024-12-24,2024-12-25,2024-12-26,2024-12-27,2024-12-28,2024-12-29,2024-12-30,2024-12-31,2025-01-01,2025-01-02,2025-01-03,2025-01-04,2025-01-05,2025-01-06,2025-01-07,2025-01-08,2025-01-09,2025-01-10,2025-01-11,2025-01-12,2025-01-13,2025-01-14,2025-01-15,2025-01-16,2025-01-17,2025-01-18,2025-01-19,2025-01-20,2025-01-21,2025-01-22,2025-01-23,2025-01-24,2025-01-25,2025-01-26,2025-01-27,2025-01-28,2025-01-29,2025-01-30,2025-01-31,2025-02-01,2025-02-02,2025-02-03,2025-02-04,2025-02-05,2025-02-06,2025-02-07,2025-02-08,2025-02-09,2025-02-10]
   Unfilled Partitions: []
   Unfilled ranges: 

2025/02/11 03:04:10 INFO  GroupBy.scala:722 - Nothing to backfill for canary-443022.data.quickstart_purchases_v1_test - given
endPartition of 2025-02-10
backfill start of 2023-11-01
Exiting...
Job [1e5c75a3-5697-44e9-a65d-831b7c526108] finished successfully.
done: true
driverControlFilesUri: gs://dataproc-staging-us-central1-703996152583-lxespibx/google-cloud-dataproc-metainfo/5d9e94ed-7649-4828-8b64-e3d58632a5d0/jobs/1e5c75a3-5697-44e9-a65d-831b7c526108/
driverOutputResourceUri: gs://dataproc-staging-us-central1-703996152583-lxespibx/google-cloud-dataproc-metainfo/5d9e94ed-7649-4828-8b64-e3d58632a5d0/jobs/1e5c75a3-5697-44e9-a65d-831b7c526108/driveroutput
jobUuid: 1e5c75a3-5697-44e9-a65d-831b7c526108
placement:
  clusterName: zipline-canary-cluster
  clusterUuid: 5d9e94ed-7649-4828-8b64-e3d58632a5d0
reference:
  jobId: 1e5c75a3-5697-44e9-a65d-831b7c526108
  projectId: canary-443022
sparkJob:
  args:
  - group-by-backfill
  - --conf-path=purchases.v1_test
  - --end-date=2025-02-10
  - --conf-type=group_bys
  - --additional-conf-path=additional-confs.yaml
  - --is-gcp
  - --gcp-project-id=canary-443022
  - --gcp-bigtable-instance-id=zipline-canary-instance
  fileUris:
  - gs://zipline-warehouse-canary/metadata/purchases.v1_test
  - gs://zipline-artifacts-canary/confs/additional-confs.yaml
  jarFileUris:
  - gs://zipline-artifacts-canary/jars/cloud_gcp_lib_deploy.jar
  mainClass: ai.chronon.spark.Driver
status:
  state: DONE
  stateStartTime: '2025-02-11T03:04:13.983885Z'
statusHistory:
- state: PENDING
  stateStartTime: '2025-02-11T03:03:30.333322Z'
- state: SETUP_DONE
  stateStartTime: '2025-02-11T03:03:30.363428Z'
- details: Agent reported job success
  state: RUNNING
  stateStartTime: '2025-02-11T03:03:30.565778Z'
yarnApplications:
- name: groupBy_quickstart.purchases.v1_test_backfill
  progress: 1.0
  state: FINISHED
  trackingUrl: http://zipline-canary-cluster-m.us-central1-c.c.canary-443022.internal.:8088/proxy/application_1738197659103_0071/

```

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Improved user feedback with a direct monitoring URL for background job
status.

- **Improvements**
  - Enhanced error handling and output display during job submissions.
- Streamlined environment configuration retrieval for greater
reliability.
- Introduced color-coded terminal messaging for clearer status
indications.

- **Dependencies**
  - Updated core dependency libraries to support improved functionality.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Tests**
- Streamlined several test suites by standardizing their execution
without legacy tagging filters.
- Ensured that core test logic remains consistent while simplifying the
test execution process.

- **Chores**
- Removed redundant tagging functionalities to reduce complexity and
improve test maintainability.
- Increased test timeout from 900 seconds to 3000 seconds to allow for
longer test execution.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
## Summary

Allow setting partition column name in sources. Maps it to the default
partition name upon read and partition checking.

## Checklist
- [x] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enabled configurable partition columns in query, join, and data
generation operations for improved data partitioning.
- **Refactor**
- Streamlined partition handling and consolidated import structures to
enhance workflow efficiency.
- **Tests**
- Added test cases for verifying partition column functionality and
adjusted data generation volumes for better validation.
- Introduced new tests specifically for different partition columns to
ensure accurate handling of partitioned data.

These enhancements provide increased flexibility and accuracy in
managing partitioned datasets during data processing and join
operations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: ezvz <[email protected]>
Co-authored-by: Nikhil Simha <[email protected]>
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced a new query to retrieve purchase records with date range
filtering.
- Enhanced data retrieval by including additional contextual metadata
for improved insights.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit


- **New Features**
- Introduced dedicated testing workflows covering multiple system
components to enhance overall reliability.
- Added new test suites for various components to enhance testing
granularity.
- **Refactor**
- Streamlined code organization with improved package structures and
consolidated imports across test modules.
- **Chores**
- Upgraded automated testing configurations with optimized resource
settings for improved performance and stability.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update


<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Chores**
- Adjusted the test execution timeout setting from a longer duration to
900 seconds to ensure tests complete more promptly.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Thomas Chow <[email protected]>
…nd not pushed to remote (#385)

## Summary
I've been seeing that it's difficult to track what changes went into
artifacts we push to etsy and canary. Especially when it comes to
tracking performance regressions for spark jobs one day to the next.

Adding a check to not allow any pushes to any customer artifacts if the
branch is dirty. All changes need to at least be pushed to remote.

And adding a metadata tag of the commit and branch


## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Introduced consistency checks during the build and upload process to
verify that local changes are committed and branches are in sync.
- Enhanced artifact metadata now includes additional context about the
code state at the time of upload.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
varant-zlai and others added 22 commits April 28, 2025 15:49
## Summary

Disabling analyzer checks

## Checklist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Refactor**
- Disabled the join configuration validation step before starting join
jobs.
- Updated time range calculation logic for certain join scenarios to
improve consistency.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: ezvz <[email protected]>
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added support for handling BigQuery views when loading tables,
improving compatibility with a wider range of BigQuery table types.
- **Bug Fixes**
- Updated internal handling of partition column aliases to ensure
accurate retrieval of partition data from BigQuery tables.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Refactor**
- Improved handling of Google Cloud Storage (GCS) artifact locations by
requiring a full artifact prefix URI instead of relying on internal
customer ID logic. All GCS interactions now use this provided prefix,
allowing for more flexible and centralized configuration.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
#701)

## Summary

- For bigquery views, there won't be an explicit partition column on the
table. Let's just use the same implementation to list primary part
columns.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Improved partition handling for BigQuery tables, allowing direct
retrieval of distinct partition values.
- **Bug Fixes**
- Added clear error handling for unsupported sub-partition filtering in
BigQuery partition queries.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
## Summary
Based on our conversation with the BigTable team, it seems like using
the Batcher implementation isn't what they recommend. It's primarily
used for flow control and doesn't really help very much to use it. This
PR yanks out that code to make the BT implementation easier to read and
reason about.

## Checklist
- [ ] Added Unit Tests
- [X] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Refactor**
- Simplified multi-get operations by removing the bulk read batcher
logic and related configuration options.
	- Consolidated multi-get requests to use a single, consistent approach.

- **Tests**
- Streamlined test setup by removing parameterized tests and updating
mocking strategies to match the new implementation.
	- Removed unused helper methods and imports for cleaner test code.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Improved handling of Google Cloud Storage bucket selection for file
uploads, now automatically using the appropriate warehouse bucket for
each customer.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
## Summary
Pull in PRs - airbnb/chronon#964 and
airbnb/chronon#932. We hit issues related to 964
in some of our tests at Etsy - groupByServingInfo lookups against BT
timed out and we end up caching the failure response. 964 addresses this
and it depends on 932 so pulling that in as well.

## Checklist
- [ ] Added Unit Tests
- [X] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Improved error handling and reporting for partial failures in join
operations and key-value store lookups.
- Enhanced cache refresh mechanisms for join configurations and
metadata, improving system robustness during failures.
- Added a configurable option to control strictness on invalid dataset
references in the in-memory key-value store.

- **Bug Fixes**
- Exceptions and partial failures are now more accurately surfaced in
fetch responses, ensuring clearer diagnostics for end-users.
	- Updated error key naming for consistency in response maps.

- **Tests**
- Added a new test to verify correct handling and reporting of partial
failures in key-value store operations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
…ssion status (#697)

## Summary
This is needed for agent to be able to track status of submitted jobs
and report them back to the orchestration service

## Checklist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Added support for specifying a custom cluster name when submitting EMR
jobs.

- **Improvements**
- Scaling factors for auto-scaling now support decimal values, allowing
more precise scaling adjustments.
- Job status methods now return status as a string, making it easier to
programmatically track job progress and errors.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

- In BigQuery, we have views and tables that can be native. For native
tables we can partition list through the information schema. We cannot
do the same for views. We should take two different approaches for
partition listing for tables and views.In order to do this, we'll do a
blind test - first check the information schema and if we can't get a
partition column out of that, we'll just do a blind `select
distinct(...)`.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Bug Fixes**
- Improved error handling for missing partition columns, providing
clearer error messages and a more robust fallback method for retrieving
partition values.
  
- **Refactor**
- Centralized the handling of missing partition columns for more
consistent behavior across the application.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Thomas Chow <[email protected]>
## Summary

- Plumb ranges through partition listing operations.
- Don't use try catch as control flow, just check bigquery native tables
to see if it's a view or table.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Summary by CodeRabbit

- **New Features**
- Added support for filtering table partitions using a filter string
across multiple storage formats (BigQuery, Delta Lake, Hive, Iceberg).
- Enhanced partition retrieval methods to allow filtering by partition
ranges.

- **Bug Fixes**
- Improved handling of partition retrieval for BigQuery views and
standard tables.

- **Style**
  - Updated method calls to use named parameters for improved clarity.

- **Tests**
- Updated tests to reflect new method signatures and parameter naming
for partition retrieval.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Refactor**
- Improved the way partition filters are applied to enhance efficiency
and maintainability. No changes to user-facing functionality.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: tchow-zlai <[email protected]>
Co-authored-by: Thomas Chow <[email protected]>
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update


<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":""}
```
-->

Co-authored-by: thomaschow <[email protected]>
## Summary
- Remove latest label view since it depends on some partition methods
taht are lightly used. We don't use this Label Join anyway anymore so
it's fine to deprecate.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Removed Features**
- Removed support for creating and managing "latest label" views and
their associated mapping logic.
- Eliminated utility methods for checking and retrieving all table
partitions.
- **Bug Fixes**
- Improved partition presence checks to include table reachability and
more explicit partition retrieval.
- **Breaking Changes**
- Updated the return type of partition parsing to preserve order and
allow duplicate keys.
- **Tests**
- Removed tests related to partition utilities and latest label mapping.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: thomaschow <[email protected]>
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Refactor**
- Improved handling of tables without partition columns to ensure
smoother data loading.
- The system now gracefully loads unpartitioned tables instead of
raising errors.

- **New Features**
- Added new data sources and group-by configurations for enhanced
purchase data aggregation.
- Introduced environment-specific upload and deletion of additional
BigQuery tables to support new group-by views.

- **Bug Fixes**
- Resolved issues where missing partition columns would previously cause
exceptions, enhancing reliability for various table types.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: thomaschow <[email protected]>
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Refactor**
- Updated partition range handling during group-by operations to use the
full specified range for backfill instead of dynamically detected
ranges.

- **Chores**
- Simplified backfill processing to cover the entire specified range
consistently.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: thomaschow <[email protected]>
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Tests**
- Improved and expanded tests to verify partition range filtering works
consistently between BigQuery native tables and views.
- Added a new test to ensure partition filtering over specific date
ranges returns matching results for both views and tables.
- Renamed and enhanced an existing test for better clarity and coverage.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: thomaschow <[email protected]>
…r Fetcher threadpool (#726)

## Summary
PR to swap our metrics reporter from statsd to open telemetry metrics.
We need otel to allow us to capture metrics in Etsy without the need of
a prometheus statsd exporter sidecar that they've seen issues with
occasionally. Otel in general is a popular metrics ingestion interface
with a number of supported backends (e.g. prom / datadog / gcloud / aws
cloudwatch). Wiring up Otel also enables us to set up traces and spans
in the repo in the future.
Broad changes:
- Decouple bulk of the metrics reporting logic from the Metrics.Context.
The metrics reporter we use is pluggable. Currently this is just the
OpenTelemetry but in principle we can support others in the future.
- Online module creates the appropriate [otel
SDK](https://opentelemetry.io/docs/languages/java/sdk/) - either we use
the [Http provider or the Prometheus Http
server](https://opentelemetry.io/docs/languages/java/configuration/#properties-exporters).
We need the Http provider to plug into Vert.x as their Micrometer
integration works with that. The Prom http server is what Etsy is keen
we use.

## Checklist
- [ ] Added Unit Tests
- [X] Covered by existing CI
- [X] Integration tested
- [ ] Documentation update

Tested via docker container and a local instance of open telemetry:
Start up fetcher docker svc
```
docker run -v ~/.config/gcloud/application_default_credentials.json:/gcp/credentials.json  -p 9000:9000  -e "GCP_PROJECT_ID=canary-443022"  -e "GOOGLE_CLOUD_PROJECT=canary-443022"  -e "GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance"  -e "EXPORTER_OTLP_ENDPOINT=http://host.docker.internal:4318"  -e GOOGLE_APPLICATION_CREDENTIALS=/gcp/credentials.json  zipline-fetcher:latest
```

And then otel:
```
./otelcol --config otel-collector-config.yaml
...
```

We see:
```
2025-04-18T17:35:37.351-0400	info	ResourceMetrics #0
Resource SchemaURL: 
Resource attributes:
     -> service.name: Str(ai.chronon)
     -> telemetry.sdk.language: Str(java)
     -> telemetry.sdk.name: Str(opentelemetry)
     -> telemetry.sdk.version: Str(1.49.0)
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope ai.chronon 3.7.0-M11
Metric #0
Descriptor:
     -> Name: kv_store.bigtable.cache.insert
     -> Description: 
     -> Unit: 
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> dataset: Str(TableId{tableId=CHRONON_METADATA})
     -> environment: Str(kv_store)
     -> production: Str(false)
StartTimestamp: 2025-04-18 21:31:52.180857637 +0000 UTC
Timestamp: 2025-04-18 21:35:37.18442138 +0000 UTC
Value: 1
Metric #1
Descriptor:
     -> Name: kv_store.bigtable.multiGet.latency
     -> Description: 
     -> Unit: 
     -> DataType: Histogram
     -> AggregationTemporality: Cumulative
HistogramDataPoints #0
Data point attributes:
     -> dataset: Str(TableId{tableId=CHRONON_METADATA})
     -> environment: Str(kv_store)
     -> production: Str(false)
StartTimestamp: 2025-04-18 21:31:52.180857637 +0000 UTC
Timestamp: 2025-04-18 21:35:37.18442138 +0000 UTC
Count: 1
Sum: 229.000000
Min: 229.000000
Max: 229.000000
ExplicitBounds #0: 0.000000
...
Buckets #0, Count: 0
...
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced OpenTelemetry-based metrics reporting throughout the
platform, replacing the previous StatsD approach.
- Added a Dockerfile and startup script for a new Fetcher service,
supporting both AWS and GCP integrations with configurable metrics
export.
- Enhanced thread pool monitoring with a new executor that provides
detailed metrics on task execution and queue status.

- **Improvements**
- Metrics tags are now structured as key-value maps, improving clarity
and flexibility.
- Metrics reporting is now context-aware, supporting per-dataset and
per-table metrics.
- Increased thread pool queue capacity for better throughput under load.
- Replaced StatsD metrics configuration with OpenTelemetry OTLP in
service launcher and build configurations.

- **Bug Fixes**
- Improved error handling and logging in metrics reporting and thread
pool management.

- **Chores**
- Updated dependencies to include OpenTelemetry, Micrometer OTLP
registry, Prometheus, OkHttp, and Kotlin libraries.
- Refactored build and test configurations to support new telemetry
libraries and remove deprecated dependencies.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
)

## Summary
Needed for orchestration service till we move over these thrift files

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Introduced support for starting workflows with detailed parameters,
including node name, branch, date range, and partition specifications.
- Responses now include the workflow identifier when a workflow is
started.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Copy link

coderabbitai bot commented May 2, 2025

Walkthrough

This set of changes introduces a new planning and execution infrastructure for Chronon, focusing on offline and online node planning for joins and group-bys. It adds new planner classes, plan node abstractions, and utilities for metadata and table dependency management. Constants previously named with "Keyword" are renamed to "Folder" throughout the codebase, with corresponding updates in logic and tests. Several methods are refactored or made private, and new Thrift schema fields are added to support richer partition and dependency metadata. Obsolete or replaced files and tests are removed.

Changes

Files/Paths Change Summary
.gitignore Added api/python/ai/chronon/agent/ to ignored paths.
api/src/main/scala/ai/chronon/api/Constants.scala Renamed constants: *Keyword*Folder for join, group-by, staging query, and model folders.
api/src/main/scala/ai/chronon/api/Extensions.scala Added window arithmetic/utilities, new source/metadata methods, updated key naming to use *Folder constants, and added methods for window and label handling.
api/src/main/scala/ai/chronon/api/planner/ConfToNode.scala Deleted: Removed configuration-to-node conversion logic and related node/asset/trait definitions.
api/src/main/scala/ai/chronon/api/planner/DependencyResolver.scala Removed add and tableDependency methods.
api/src/main/scala/ai/chronon/api/planner/GroupByOfflinePlanner.scala Added: New planner for group-by offline execution nodes, with step days logic, dependency extraction, and plan node construction.
api/src/main/scala/ai/chronon/api/planner/JoinOfflinePlanner.scala Added: New planner for join offline execution, building plan node sequences for all join stages and providing semantic hashing.
api/src/main/scala/ai/chronon/api/planner/MetaDataUtils.scala Added: Metadata utility for layering/merging execution info, partition spec, and dependencies.
api/src/main/scala/ai/chronon/api/planner/NodeRunner.scala Added: Traits for batch run context and node runner, plus a stub lineage runner.
api/src/main/scala/ai/chronon/api/planner/PartitionSpecWithColumn.scala Added: Case class combining partition column and spec.
api/src/main/scala/ai/chronon/api/planner/PlanNode.scala Added: Plan node trait, utility for parsing confs, and stub plan generation methods.
api/src/main/scala/ai/chronon/api/planner/Planner.scala Added: Abstract planner class for offline/online/metrics nodes.
api/src/main/scala/ai/chronon/api/planner/TableDependencies.scala Added: Table dependency extraction logic for joins, group-bys, and sources.
api/src/test/scala/ai/chronon/api/test/DependencyResolverSpec.scala Deleted: Removed test suite for dependency resolver methods.
api/thrift/api.thrift Moved partitionColumn in Query, added partitionFormat, partitionInterval, and partitionLag fields with comments.
api/thrift/common.thrift Added KvInfo struct; updated KvDependency and ExecutionInfo to use/contain new fields; minor comment fix.
api/thrift/orchestration.thrift Added node types and structs for group-by and staging queries; renamed LabelPartNode to LabelJoinNode; updated union and enums.
cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreTest.scala Updated to use *Folder constants instead of *Keyword in key construction and entity type filtering.
online/src/main/scala/ai/chronon/online/MetadataDirWalker.scala Updated path matching to use *Folder constants instead of *Keyword.
online/src/main/scala/ai/chronon/online/fetcher/FetcherMain.scala Changed allowed/confType values from *Keyword to *Folder constants; updated control flow accordingly.
online/src/main/scala/ai/chronon/online/fetcher/GroupByFetcher.scala Deduplicates requests by converting to a map before sequence.
online/src/main/scala/ai/chronon/online/fetcher/MetadataStore.scala Updated configuration and entity type lookups to use *Folder constants.
online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala Changed drift calculation logic to use a fixed timestamp for previous summary.
spark/src/main/scala/ai/chronon/spark/Driver.scala Made tableUtils implicit in two run methods; updated imports.
spark/src/main/scala/ai/chronon/spark/Join.scala Replaced BootstrapJob with JoinBootstrapJob in left side computation.
spark/src/main/scala/ai/chronon/spark/JoinBase.scala Replaced BootstrapJob with JoinBootstrapJob in left computation.
spark/src/main/scala/ai/chronon/spark/JoinUtils.scala Refactored left source table name methods to use JoinOfflinePlanner and require implicit TableUtils.
spark/src/main/scala/ai/chronon/spark/batch/JoinBootstrapJob.scala Renamed class from BootstrapJob to JoinBootstrapJob; improved doc comment.
spark/src/main/scala/ai/chronon/spark/batch/JoinPartJob.scala Refactored method to use a context object for join part computation.
spark/src/main/scala/ai/chronon/spark/batch/LabelJoinV2.scala Made three methods private; added clarifying comment.
spark/src/main/scala/ai/chronon/spark/batch/MergeJob.scala Adjusted logic for partition range shifting; updated imports.
spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala Added public outputPartitionSpec field combining column and spec.
spark/src/test/scala/ai/chronon/spark/test/batch/ModularJoinTest.scala Replaced BootstrapJob with JoinBootstrapJob in test setup.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Planner
    participant PlanNode
    participant MetaDataUtils
    participant TableDependencies

    User->>Planner: Initialize with Join/GroupBy conf
    Planner->>MetaDataUtils: Layer metadata for node
    Planner->>TableDependencies: Extract table dependencies
    Planner->>PlanNode: Construct plan nodes (offline/online)
    PlanNode-->>User: Expose metaData, contents, semanticHash
Loading

Possibly related PRs

Suggested reviewers

  • varant-zlai
  • tchow-zlai

Poem

In folders not keywords, our configs now dwell,
With planners and nodes, new stories to tell.
Metadata layered, dependencies found,
Old tests and code swept off the ground.
Through joins and group-bys, our planning takes flight—
Chronon’s new engine, ready for night!
🚀

Warning

Review ran into problems

🔥 Problems

GitHub Actions and Pipeline Checks: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository.

Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 38623a9 and 9c5f949.

📒 Files selected for processing (2)
  • spark/src/main/scala/ai/chronon/spark/Join.scala (1 hunks)
  • spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala
  • spark/src/main/scala/ai/chronon/spark/Join.scala
⏰ Context from checks skipped due to timeout of 90000ms (31)
  • GitHub Check: service_tests
  • GitHub Check: streaming_tests
  • GitHub Check: service_commons_tests
  • GitHub Check: groupby_tests
  • GitHub Check: service_tests
  • GitHub Check: online_tests
  • GitHub Check: analyzer_tests
  • GitHub Check: streaming_tests
  • GitHub Check: online_tests
  • GitHub Check: analyzer_tests
  • GitHub Check: cloud_gcp_tests
  • GitHub Check: spark_tests
  • GitHub Check: flink_tests
  • GitHub Check: spark_tests
  • GitHub Check: cloud_aws_tests
  • GitHub Check: join_tests
  • GitHub Check: cloud_gcp_tests
  • GitHub Check: batch_tests
  • GitHub Check: api_tests
  • GitHub Check: fetcher_tests
  • GitHub Check: cloud_aws_tests
  • GitHub Check: groupby_tests
  • GitHub Check: flink_tests
  • GitHub Check: batch_tests
  • GitHub Check: api_tests
  • GitHub Check: fetcher_tests
  • GitHub Check: aggregator_tests
  • GitHub Check: scala_compile_fmt_fix
  • GitHub Check: aggregator_tests
  • GitHub Check: join_tests
  • GitHub Check: enforce_triggered_workflows

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

🧹 Nitpick comments (8)
api/thrift/orchestration.thrift (1)

120-124: Dangling comma style differs

PREPARE_LOGS keeps the comma while LABEL_JOIN line does not. Stick to one style.

api/src/main/scala/ai/chronon/api/planner/MetaDataUtils.scala (1)

2-18: Prune unused imports

Gson, File, FileReader, Files, Paths, Try appear unused. Kill them to cut compile time & warnings.

api/src/main/scala/ai/chronon/api/planner/PlanNode.scala (1)

18-24: listFiles loses directory context

Recursive call listFiles(file.getPath) returns absolute/relative mix; stripping "./" only on top-level means nested paths keep prefixes. Not harmful but inconsistent for downstream parseConfs. Consider baseDir.toPath.relativize.

api/src/main/scala/ai/chronon/api/planner/GroupByOfflinePlanner.scala (1)

12-18: effectiveStepDays shortcut ignores 0 / negative overrides

If stepDays is explicitly set to 0 (commonly used to disable stepping) it’ll be ignored. Validate value instead of just checking option presence.

api/src/main/scala/ai/chronon/api/planner/TableDependencies.scala (1)

64-67: Partition interval conversion can overflow

WindowUtils.hours(specWithColumn.partitionSpec.spanMillis) converts long millis to Int length; spans > ~24y overflow. Use toLong or sanity-check.

api/src/main/scala/ai/chronon/api/Extensions.scala (3)

83-88: inverse returns negative length window – callers must guard

Some downstream code assumes non-negative lengths. Consider documenting or returning null for zero.


104-105: Precision loss in hours helper

(millis / Hour.millis).toInt truncates; e.g., 3599 999 ms → 0 h. Use rounding or Long length.


409-415: mutationsTable may emit empty string

Guard against "" to avoid invalid table names.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between c9a08a5 and 38623a9.

📒 Files selected for processing (33)
  • .gitignore (1 hunks)
  • api/src/main/scala/ai/chronon/api/Constants.scala (1 hunks)
  • api/src/main/scala/ai/chronon/api/Extensions.scala (7 hunks)
  • api/src/main/scala/ai/chronon/api/planner/ConfToNode.scala (0 hunks)
  • api/src/main/scala/ai/chronon/api/planner/DependencyResolver.scala (1 hunks)
  • api/src/main/scala/ai/chronon/api/planner/GroupByOfflinePlanner.scala (1 hunks)
  • api/src/main/scala/ai/chronon/api/planner/JoinOfflinePlanner.scala (1 hunks)
  • api/src/main/scala/ai/chronon/api/planner/MetaDataUtils.scala (1 hunks)
  • api/src/main/scala/ai/chronon/api/planner/NodeRunner.scala (1 hunks)
  • api/src/main/scala/ai/chronon/api/planner/PartitionSpecWithColumn.scala (1 hunks)
  • api/src/main/scala/ai/chronon/api/planner/PlanNode.scala (1 hunks)
  • api/src/main/scala/ai/chronon/api/planner/Planner.scala (1 hunks)
  • api/src/main/scala/ai/chronon/api/planner/TableDependencies.scala (1 hunks)
  • api/src/test/scala/ai/chronon/api/test/DependencyResolverSpec.scala (0 hunks)
  • api/thrift/api.thrift (1 hunks)
  • api/thrift/common.thrift (3 hunks)
  • api/thrift/orchestration.thrift (2 hunks)
  • cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreTest.scala (3 hunks)
  • online/src/main/scala/ai/chronon/online/MetadataDirWalker.scala (2 hunks)
  • online/src/main/scala/ai/chronon/online/fetcher/FetcherMain.scala (2 hunks)
  • online/src/main/scala/ai/chronon/online/fetcher/GroupByFetcher.scala (1 hunks)
  • online/src/main/scala/ai/chronon/online/fetcher/MetadataStore.scala (2 hunks)
  • online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala (2 hunks)
  • spark/src/main/scala/ai/chronon/spark/Driver.scala (3 hunks)
  • spark/src/main/scala/ai/chronon/spark/Join.scala (1 hunks)
  • spark/src/main/scala/ai/chronon/spark/JoinBase.scala (1 hunks)
  • spark/src/main/scala/ai/chronon/spark/JoinUtils.scala (2 hunks)
  • spark/src/main/scala/ai/chronon/spark/batch/JoinBootstrapJob.scala (1 hunks)
  • spark/src/main/scala/ai/chronon/spark/batch/JoinPartJob.scala (2 hunks)
  • spark/src/main/scala/ai/chronon/spark/batch/LabelJoinV2.scala (3 hunks)
  • spark/src/main/scala/ai/chronon/spark/batch/MergeJob.scala (3 hunks)
  • spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala (2 hunks)
  • spark/src/test/scala/ai/chronon/spark/test/batch/ModularJoinTest.scala (1 hunks)
💤 Files with no reviewable changes (2)
  • api/src/test/scala/ai/chronon/api/test/DependencyResolverSpec.scala
  • api/src/main/scala/ai/chronon/api/planner/ConfToNode.scala
🧰 Additional context used
🧬 Code Graph Analysis (2)
spark/src/main/scala/ai/chronon/spark/Driver.scala (2)
api/src/main/scala/ai/chronon/api/planner/PartitionSpecWithColumn.scala (1)
  • PartitionSpecWithColumn (4-4)
api/src/main/scala/ai/chronon/api/planner/RelevantLeftForJoinPart.scala (2)
  • RelevantLeftForJoinPart (16-24)
  • RelevantLeftForJoinPart (26-92)
spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala (1)
api/src/main/scala/ai/chronon/api/planner/PartitionSpecWithColumn.scala (1)
  • PartitionSpecWithColumn (4-4)
⏰ Context from checks skipped due to timeout of 90000ms (31)
  • GitHub Check: streaming_tests
  • GitHub Check: service_commons_tests
  • GitHub Check: streaming_tests
  • GitHub Check: groupby_tests
  • GitHub Check: cloud_gcp_tests
  • GitHub Check: spark_tests
  • GitHub Check: analyzer_tests
  • GitHub Check: flink_tests
  • GitHub Check: cloud_aws_tests
  • GitHub Check: join_tests
  • GitHub Check: online_tests
  • GitHub Check: spark_tests
  • GitHub Check: groupby_tests
  • GitHub Check: service_tests
  • GitHub Check: service_tests
  • GitHub Check: join_tests
  • GitHub Check: fetcher_tests
  • GitHub Check: online_tests
  • GitHub Check: cloud_gcp_tests
  • GitHub Check: fetcher_tests
  • GitHub Check: batch_tests
  • GitHub Check: flink_tests
  • GitHub Check: api_tests
  • GitHub Check: batch_tests
  • GitHub Check: analyzer_tests
  • GitHub Check: scala_compile_fmt_fix
  • GitHub Check: api_tests
  • GitHub Check: cloud_aws_tests
  • GitHub Check: aggregator_tests
  • GitHub Check: aggregator_tests
  • GitHub Check: enforce_triggered_workflows
🔇 Additional comments (46)
.gitignore (1)

29-29: Ignore agent directory
Added pattern to exclude api/python/ai/chronon/agent/, matching the existing chronon ignore rules.

online/src/main/scala/ai/chronon/online/fetcher/GroupByFetcher.scala (1)

178-179: Good optimization for request deduplication.

Adding .toMap before .toSeq eliminates duplicate requests by key, improving efficiency.

spark/src/main/scala/ai/chronon/spark/batch/JoinBootstrapJob.scala (1)

19-25: Improved class naming and documentation.

Renaming from BootstrapJob to JoinBootstrapJob provides better type specificity for join operations. Documentation formatting has been enhanced for better ScalaDoc generation.

spark/src/test/scala/ai/chronon/spark/test/batch/ModularJoinTest.scala (1)

259-259: Updated instantiation to match renamed class.

Test correctly uses the new JoinBootstrapJob class name that replaced BootstrapJob, maintaining consistency with implementation changes.

api/src/main/scala/ai/chronon/api/planner/PartitionSpecWithColumn.scala (1)

1-4: Clean, minimal data container for partitioning metadata.

This simple case class effectively combines partition column name with partition specification, supporting the new planning infrastructure.

spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala (2)

22-22: LGTM - New import for PartitionSpecWithColumn is necessary for the next change.

This import is needed for the outputPartitionSpec field being added below.


56-57: Integrates partitioning information into the new planner infrastructure.

This new field provides a unified representation of partition column and specification that's used implicitly across the codebase's new planning components.

spark/src/main/scala/ai/chronon/spark/Join.scala (1)

260-260: LGTM - Specialized bootstrap job improves type safety.

Replacing generic BootstrapJob with the join-specific JoinBootstrapJob aligns with similar changes throughout the codebase.

spark/src/main/scala/ai/chronon/spark/JoinBase.scala (1)

187-187: LGTM - Consistent with the change in Join.scala.

Specialized bootstrap job improves clarity and type safety for join operations.

cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreTest.scala (2)

3-3: LGTM - Updated imports to use renamed "Folder" constants.

Part of the codebase-wide renaming from "Keyword" to "Folder" constants.


215-215: LGTM - Consistent renaming from "Keyword" to "Folder" constants.

Updates all usages to the renamed constants. Test behavior remains unchanged.

Also applies to: 221-221, 231-231

online/src/main/scala/ai/chronon/online/fetcher/MetadataStore.scala (2)

81-85: Constant rename looks good.
Mapping now aligns with Constants._ rename; behavior unchanged.


218-220:

✅ Verification successful

Double-check server side expects JoinFolder.
List API previously consumed JoinKeyword. If the backing KV-Store service wasn’t migrated, listing joins will silently return zero results.


🏁 Script executed:

#!/bin/bash
# Quickly confirm keys still exist under the new prefix.
# Adjust DATASET env var if needed.
kvtool list --dataset "$CHRONON_METADATA_DATASET" --prefix "$JoinFolder/" | head

Length of output: 141


🏁 Script executed:

#!/bin/bash
# Locate where JoinFolder and JoinKeyword are defined/used in the codebase
rg -n 'JoinFolder' -C2
rg -n 'JoinKeyword' -C2

Length of output: 6695


No JoinKeyword found—JoinFolder is correct
Searched codebase for JoinKeyword; only JoinFolder = "joins" exists and is used in code and tests. No migration gaps detected.

online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala (1)

3-3: Import addition fine.

online/src/main/scala/ai/chronon/online/MetadataDirWalker.scala (1)

87-99: Renamed folder constants applied correctly.
Logic untouched; no issues spotted.

Also applies to: 112-131

api/src/main/scala/ai/chronon/api/planner/DependencyResolver.scala (1)

4-8: Updated imports to support new planning infrastructure.

New imports added for Extensions, WindowUtils, SourceOps, and ScalaJavaConversions, aligning with the broader architectural changes in the planner subsystem.

spark/src/main/scala/ai/chronon/spark/Driver.scala (3)

22-22: Added PartitionSpecWithColumn import.

Added import for the new PartitionSpecWithColumn class that's part of the planning infrastructure.


825-825: Made tableUtils implicit.

Changed tableUtils to implicit to support methods that require implicit TableUtils in the call chain, particularly for partition specification propagation.


889-890: Made tableUtils implicit in JoinPartJobRun.

Same change as in SourceJobRun - making tableUtils implicit for downstream method calls.

api/src/main/scala/ai/chronon/api/planner/Planner.scala (1)

1-7: New Planner abstraction for execution planning.

This abstract class provides a framework for different planners (Join, GroupBy, etc.) to generate execution plan nodes. It separates offline, online, and metrics processing.

The TODO comment for metricsNodes indicates future work. Consider implementing this in a follow-up PR.

spark/src/main/scala/ai/chronon/spark/batch/MergeJob.scala (2)

5-6: Updated imports for planning infrastructure.

Added PartitionSpecWithColumn import and reorganized other imports to support the planning changes.


73-73: Refined condition for effective range adjustment.

The condition for adjusting the partition range is now more specific - only applies when left data model is EVENTS and join part accuracy is SNAPSHOT.

api/src/main/scala/ai/chronon/api/Constants.scala (1)

84-87: Renamed constants for clarity

Changed suffixes from "Keyword" to "Folder" to better reflect their purpose.

online/src/main/scala/ai/chronon/online/fetcher/FetcherMain.scala (2)

38-42: Updated constant references

Updated to use renamed folder constants.


202-206: Updated folder constant reference

Changed JoinKeyword to JoinFolder to match Constants.scala changes.

spark/src/main/scala/ai/chronon/spark/batch/LabelJoinV2.scala (3)

133-134: Added documentation and restricted visibility

Made method private and clarified its purpose.


307-310: Restricted method visibility

Changed method from public to private for better encapsulation.


356-357: Restricted method visibility

Changed method from public to private for better encapsulation.

api/thrift/api.thrift (1)

19-42: Enhanced partition metadata

Removed old partitionColumn field and added comprehensive partition metadata with documentation:

  • Reintroduced partitionColumn at ID 20
  • Added partitionFormat for date/time formatting
  • Added partitionInterval for source timespan
  • Added partitionLag for expected delays

These changes better support the planner infrastructure for managing dependencies and partitioning.

api/thrift/common.thrift (2)

104-108: Good refactoring to encapsulate KV metadata.

Creating a reusable KvInfo struct improves schema organization.


110-112: Cleaner dependency representation.

Using KvInfo reference streamlines the KvDependency struct.

api/src/main/scala/ai/chronon/api/planner/NodeRunner.scala (2)

6-8: Good trait design.

BatchRunContext cleanly exposes partition specification.


10-12: Flexible execution abstraction.

Generic NodeRunner trait enables type-safe configuration handling.

spark/src/main/scala/ai/chronon/spark/JoinUtils.scala (3)

24-24: Good import addition.

Adding planner imports supports the refactored table name computation methods.


525-527: Improved table name computation.

Using JoinOfflinePlanner for left source table names ensures consistency with planning infrastructure.


529-531: Consistent naming pattern.

Using the planner for full table names maintains consistency with the approach in computeLeftSourceTableName.

spark/src/main/scala/ai/chronon/spark/batch/JoinPartJob.scala (4)

6-6: Simplified import.

Using wildcard import reduces code verbosity.


66-71: Updated method call for context object.

Method call adjusted to use the context object pattern.


74-77: Simplified method signature.

Using JoinPartJobContext reduces parameter count and improves maintainability.


85-91: Context object used consistently.

jobContext properties accessed cleanly throughout method.

Also applies to: 94-95

api/thrift/orchestration.thrift (2)

209-213: Struct rename may break JSON/DB payloads

LabelPartNodeLabelJoinNode but retains field id 6 in NodeUnion. Verify consumers that still expect the old name.


229-247: NodeUnion re-ordering risk

Field ids 7-10 are new; good. Confirm no removed ids and update all deserializers.
Also please append a trailing comma after the last field to avoid diff churn.

api/src/main/scala/ai/chronon/api/planner/MetaDataUtils.scala (1)

42-53: outputTable ≠ null check

If copy.outputTable is null, we propagate null into outputTableInfo.table, causing downstream null-table errors. Fallback to nodeName or throw early.

api/src/main/scala/ai/chronon/api/planner/GroupByOfflinePlanner.scala (1)

9-11: Not planning anything yet

offlineNodes returns empty and onlineNodes is unimplemented – the planner produces no plan.

Add something like:

   override def offlineNodes: Seq[PlanNode] =
-    {
-
-    }
+    backfillNodeOpt.toSeq

   override def onlineNodes: Seq[PlanNode] = Seq.empty   // or proper impl

Also applies to: 29-33

api/src/main/scala/ai/chronon/api/Extensions.scala (2)

136-153: plus/minus require identical units

Good safety check 👍.


899-900: Constant rename looks correct

Updated to JoinFolder; matches new convention.

Comment on lines 96 to 97
val previousTs = startMs + (2 * WindowUtils.Day.millis)
val previousSummary = summariesByTimestamp.get(previousTs)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Bug: drift uses fixed 2-day offset, ignores lookBackWindow.
previousTs should be ts - lookBackMs; current change also leaves lookBackMs unused.

-        val previousTs = startMs + (2 * WindowUtils.Day.millis)
+        val previousTs = ts - lookBackMs
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
val previousTs = startMs + (2 * WindowUtils.Day.millis)
val previousSummary = summariesByTimestamp.get(previousTs)
val previousTs = ts - lookBackMs
val previousSummary = summariesByTimestamp.get(previousTs)

Comment on lines +134 to +136
200: optional list<KvDependency> kvDependencies
201: optional KvInfo outputKvInfo
202: optional i64 kvPollIntervalMillis
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

❓ Verification inconclusive

Verify backward compatibility.

Field renaming (kvDependency → kvDependencies) and ID shift (201 → 202) may affect clients.


🏁 Script executed:

#!/bin/bash
# Check for usages of old field names in codebase
rg -i "kvDependency\b" --type scala --type java

Length of output: 49


Ensure Thrift backward compatibility for kvDependencies rename & field IDs

Ran rg -i "kvDependency\b"—no Scala/Java refs. However, renaming field 200 from kvDependencykvDependencies and shifting subsequent IDs (outputKvInfo 202→201, kvPollIntervalMillis 203→202) breaks existing clients. To address:

• api/thrift/common.thrift:

  • Re-add the old field as a deprecated alias on a new ID; or
  • Keep kvDependencies at ID 200 and assign a new ID to the removed field
    • Bump the Thrift API version & notify downstream consumers
    • Update service documentation and codegen scripts

Comment on lines +14 to +19
object LineageOfflineRunner {
def readFiles(folderPath: String): Seq[Any] = {
// read files from folder using metadata
Seq.empty
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Implementation needed for LineageOfflineRunner.

The readFiles method is just a placeholder returning empty results.

Implement actual file reading functionality or add TODO comment with timeline.

Comment on lines 108 to 115
LEFT_SOURCE = 1
BOOTSTRAP = 2,
RIGHT_PART = 3,
MERGE = 4,
DERIVE = 5,
LABEL_PART = 6,
LABEL_JOIN = 7,

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Enum values renumbered – breaks wire compat

LEFT_SOURCE is inserted at position 1, shifting all subsequent numeric ids (e.g. BOOTSTRAP was 1, now 2). Any persisted data or cross-service code compiled against the old ids will deserialize to the wrong variants.
Add new values to the tail instead and keep existing ids immutable.

Comment on lines +108 to +109
LEFT_SOURCE = 1
BOOTSTRAP = 2,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Missing comma → Thrift parse error

LEFT_SOURCE = 1 lacks a trailing comma like the rest of the list. Some thrift compilers require it.

-    LEFT_SOURCE = 1
+    LEFT_SOURCE = 1,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
LEFT_SOURCE = 1
BOOTSTRAP = 2,
LEFT_SOURCE = 1,
BOOTSTRAP = 2,

Comment on lines 494 to 505
def allWindows: Array[Window] = {
groupBy.aggregations
.iterator()
.toScala
.flatMap { agg =>
Option(agg.windows)
.map(_.iterator().toScala)
.getOrElse(Array(WindowUtils.Null).iterator)
}
.toArray
.distinct
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

allWindows emits null sentinel

Returning null increases NPE risk elsewhere. Filter before returning.

-        .toArray
-        .distinct
+        .filter(_ != null)
+        .distinct
+        .toArray

Comment on lines +43 to +45
// bootstrap tables are unfortunately unique to the join - can't be re-used if a new join part is added
val bootstrapNodeName = join.metaData.name + "/boostrap"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix typo: “boostrap” → “bootstrap”.
Wrong table name could break downstream look-ups.

-    val bootstrapNodeName = join.metaData.name + "/boostrap"
+    val bootstrapNodeName = join.metaData.name + "/bootstrap"

result
}

override def onlineNodes: Seq[PlanNode] = ???
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

onlineNodes still ??? – will throw NotImplementedError.
Return at least an empty Seq or implement logic.

-  override def onlineNodes: Seq[PlanNode] = ???
+  // TODO implement online planning
+  override def onlineNodes: Seq[PlanNode] = Seq.empty

Comment on lines +129 to +133
private val derivationNodeOpt: Option[JoinDerivationNode] = Option(join.derivations).map { _ =>
val result = new JoinDerivationNode()
.setJoin(join)

val derivationNodeName = join.metaData.name + "/derived"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Skip derivation node when list empty.

-  private val derivationNodeOpt: Option[JoinDerivationNode] = Option(join.derivations).map { _ =>
+  private val derivationNodeOpt: Option[JoinDerivationNode] =
+    Option(join.derivations).filter(_.size() > 0).map { _ =>
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
private val derivationNodeOpt: Option[JoinDerivationNode] = Option(join.derivations).map { _ =>
val result = new JoinDerivationNode()
.setJoin(join)
val derivationNodeName = join.metaData.name + "/derived"
private val derivationNodeOpt: Option[JoinDerivationNode] =
Option(join.derivations).filter(_.size() > 0).map { _ =>
val result = new JoinDerivationNode()
.setJoin(join)
val derivationNodeName = join.metaData.name + "/derived"
// … rest of the block …
}

Comment on lines +280 to +283
override def semanticHash: String = ThriftJsonCodec.hexDigest({
val result = node.deepCopy()
unsetNestedMetadata(node)
result
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Hashing mutates original Join.
Should scrub the copy, not the source.

-      unsetNestedMetadata(node)
+      unsetNestedMetadata(result)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
override def semanticHash: String = ThriftJsonCodec.hexDigest({
val result = node.deepCopy()
unsetNestedMetadata(node)
result
override def semanticHash: String = ThriftJsonCodec.hexDigest({
val result = node.deepCopy()
- unsetNestedMetadata(node)
+ unsetNestedMetadata(result)
result

tchow-zlai and others added 5 commits May 5, 2025 10:26
## Summary

- We aren't currently using this, as the cache level is set to `NONE`.
To simplify things we'll just remove places where it was referenced.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Refactor**
- Simplified join computation by removing internal caching and improving
error handling.
- **Chores**
  - Eliminated caching-related code to enhance system maintainability.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: thomaschow <[email protected]>
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced a new test for joining GCP-based training set data,
supporting different join configurations.
- Added a new backfill step for join operations in the data processing
pipeline, with environment-specific configuration handling.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: thomaschow <[email protected]>
## Summary

Putting this up again - #684

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Added support for the Avro logical type `timestamp-millis` in schema
and value conversions, enabling better handling of timestamp fields.
- Enhanced BigQuery integration with a new test to verify correct
timestamp conversions based on configuration settings.

- **Documentation**
- Added detailed comments explaining the mapping behavior of timestamp
types and relevant configuration flags.

- **Refactor**
- Improved logging structure for serialized object size calculations for
better readability.
  - Minor formatting and consistency improvements in test assertions.

- **Style**
  - Removed unnecessary trailing whitespace for cleaner code.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

- Drop tables in the join integration tests. 

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Chores**
- Enhanced cleanup process to remove additional BigQuery tables in
"canary" and "dev" environments.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: thomaschow <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants