Skip to content

WIP - feat: multiple partition format support #749

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 427 commits into
base: main
Choose a base branch
from

Conversation

nikhil-zlai
Copy link
Contributor

@nikhil-zlai nikhil-zlai commented May 8, 2025

Summary

We recently added support to specify partition format and partition column to sources.

So when we use these sources in groupBy-s and join left side - we need to first convert the query range to the source specs and convert back the dataframes to contain partitionSpecs in the output spec format (this is set once globally and available via tableUtils.partitionSpec).

the translate methods in the PR are added to date strings, partition ranges and dataframes and re-used across code.

Checklist

  • Added Unit Tests
  • Covered by existing CI -- varant to add a breaking test and fix it.
  • Integration tested
  • Documentation update

Summary by CodeRabbit

  • Bug Fixes

    • Improved error handling for partition range validation to provide clearer input validation errors.
    • Enhanced partition range computation for joins, ensuring only relevant partitions are processed.
  • New Features

    • Added support for specifying partition formats when generating test data, allowing for more flexible and accurate test setups.
    • Introduced partition specification translation and interval window support for improved date and partition handling.
    • Added functionality to translate and reformat partition columns in data frames for consistent partitioning.
  • Tests

    • Updated tests to use explicit partition formats and improved configuration for join scenarios.
    • Ensured consistency in partitioning semantics across test cases.
  • Refactor

    • Renamed internal methods for clarity and consistency in partition range computation logic.
    • Refactored range and partition condition computations to leverage implicit context and source partition specifications for improved consistency.

tchow-zlai and others added 30 commits February 11, 2025 22:44
## Summary

- Adding new BUILD for cloud_aws
- Adding the above to the CI/CD

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced AWS integration with a new library for AWS functionality and
a framework for job submission.
- Introduced a new utility for managing job submissions, statuses, and
terminations.
- Added dedicated triggers for cloud modules to improve workflow
automation.
- **Tests**
- Improved testing coverage with additional utilities for validating
cloud functionalities and increased timeout settings for asynchronous
operations.
- **Chores**
- Updated dependency configurations to incorporate essential AWS SDK
components.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
… fix gbu for bigquery (#365)

## Summary
^^^

This is being done because the current chronon engine assumes partition
field column is a string type but the partition field of bigquery native
tables are a date type.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced data processing by automatically formatting date-based
partition columns, ensuring robust handling of partitioned native tables
for more reliable data scanning.
- Simplified retrieval of required columns in the `buildServingInfo`
method, improving efficiency by directly selecting columns using a dummy
table.
- **Bug Fixes**
- Improved logging for the scanning process, providing better
traceability during data operations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
While running our Flink jobs we do see periodic restarts as we're low on
direct memory. Direct mem is required by Kafka consumer clients as well
as BigTable's client sdk. Flink's default seems to be [0
bytes](https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/config/).
Bumping this a bit to 1G seems to result in the jobs running without
restarting every hour.

Before:
![Screenshot 2025-02-11 at 4 24
30 PM](https://github.com/user-attachments/assets/75f88687-9ecf-4fc4-b89c-e863be3ee1ff)

After:
![Screenshot 2025-02-11 at 9 16
48 AM](https://github.com/user-attachments/assets/bc66aa0a-1b92-4b46-a78d-0c70168288d7)


## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [X] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Improved memory allocation for job processing, allocating additional
off-heap memory to enhance performance and reliability for applications
with high memory demands.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Bug Fixes**
- Improved handling of date-based partition columns during table
processing to ensure data is formatted and consolidated accurately.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced the table creation process to return clear, detailed
statuses, improving feedback during table building.
- Introduced a new method for generating table builders that integrates
with BigQuery, including error handling for partitioning.
- Streamlined data writing operations to cloud storage with automatic
path configuration and Parquet integration.
- Added explicit partitioning for DataFrame saves in Hive, Delta, and
Iceberg formats.
  
- **Refactor**
- Overhauled logic to enforce partition restrictions and incorporate
robust error handling for a smoother user experience.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: tchow-zlai <[email protected]>
Co-authored-by: Thomas Chow <[email protected]>
…374)

## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Improved the cloud upload process to include additional metadata with
each file, enhancing traceability and information capture.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Refactor**
- Modified join analysis behavior to disable automatic table permission
checks by default, simplifying operations. Users can now explicitly
enable permission validation when needed.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

Co-authored-by: Thomas Chow <[email protected]>
…e do it in tableutils scandf (#368)

…e do it in tableutils scandf

## Summary

Doing this PR because of #365

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Refactor**
- Simplified the data join processing to eliminate redundant
transformations, ensuring a more streamlined handling of left-side data
during join operations.
- Updated underlying logic to adjust how partition details are managed,
which may influence the output schema in data processing workflows.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
#359)

## Summary

- also refactored out google-crc32c because it was slow due to it
falling back to the non C implementation. using a different library

Tested here:

```
(tmp_chronon) davidhan@Davids-MacBook-Pro: ~/zipline/chronon/cananry-confs (davidhan/canary) $ zipline run --conf production/group_bys/quickstart/purchases.v1_test --dataproc
/Users/davidhan/zipline/chronon/tmp_chronon/lib/python3.13/site-packages/google_crc32c/__init__.py:29: RuntimeWarning: As the c extension couldn't be imported, `google-crc32c` is using a pure python implementation that is significantly slower. If possible, please configure a c build environment and compile the extension
  warnings.warn(_SLOW_CRC32C_WARNING, RuntimeWarning)
Running with args: {'conf': 'production/group_bys/quickstart/purchases.v1_test', 'dataproc': True, 'env': 'dev', 'mode': None, 'ds': None, 'app_name': None, 'start_ds': None, 'end_ds': None, 'parallelism': None, 'repo': '.', 'online_jar': 'cloud_gcp_lib_deploy.jar', 'online_class': 'ai.chronon.integrations.cloud_gcp.GcpApiImpl', 'version': None, 'spark_version': '2.4.0', 'spark_submit_path': None, 'spark_streaming_submit_path': None, 'online_jar_fetch': None, 'sub_help': False, 'conf_type': None, 'online_args': None, 'chronon_jar': None, 'release_tag': None, 'list_apps': None, 'render_info': None, 'groupby_name': None, 'kafka_bootstrap': None, 'mock_source': False, 'savepoint_uri': None}
Setting env variables:
From <common_env> setting VERSION=latest
From <common_env> setting SPARK_SUBMIT_PATH=[TODO]/path/to/spark-submit
From <common_env> setting JOB_MODE=local[*]
From <common_env> setting HADOOP_DIR=[STREAMING-TODO]/path/to/folder/containing
From <common_env> setting CHRONON_ONLINE_CLASS=[ONLINE-TODO]your.online.class
From <common_env> setting CHRONON_ONLINE_ARGS=[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT>
From <common_env> setting PARTITION_COLUMN=ds
From <common_env> setting PARTITION_FORMAT=yyyy-MM-dd
From <common_env> setting CUSTOMER_ID=canary
From <common_env> setting GCP_PROJECT_ID=canary-443022
From <common_env> setting GCP_REGION=us-central1
From <common_env> setting GCP_DATAPROC_CLUSTER_NAME=zipline-canary-cluster
From <common_env> setting GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance
From <cli_args> setting APP_NAME=chronon
From <cli_args> setting CHRONON_ONLINE_JAR=cloud_gcp_lib_deploy.jar
Local hash of /tmp/zipline/cloud_gcp_submitter_deploy.jar: Inl1LA==. GCS file jars/cloud_gcp_submitter_deploy.jar hash: Inl1LA==
/tmp/zipline/cloud_gcp_submitter_deploy.jar matches GCS zipline-artifacts-canary/jars/cloud_gcp_submitter_deploy.jar
File production/group_bys/quickstart/purchases.v1_test uploaded to metadata/purchases.v1_test in bucket zipline-warehouse-canary.
Running command: java -cp /tmp/zipline/cloud_gcp_submitter_deploy.jar ai.chronon.integrations.cloud_gcp.DataprocSubmitter group-by-backfill --conf-path=purchases.v1_test --end-date=2025-02-10  --conf-type=group_bys      --jar-uri=gs://zipline-artifacts-canary/jars/cloud_gcp_lib_deploy.jar --job-type=spark --main-class=ai.chronon.spark.Driver --additional-conf-path=additional-confs.yaml --gcs-files=gs://zipline-warehouse-canary/metadata/purchases.v1_test,gs://zipline-artifacts-canary/confs/additional-confs.yaml
WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.
Array(group-by-backfill, --conf-path=purchases.v1_test, --end-date=2025-02-10, --conf-type=group_bys, --additional-conf-path=additional-confs.yaml, --is-gcp, --gcp-project-id=canary-443022, --gcp-bigtable-instance-id=zipline-canary-instance)
Dataproc submitter job id: 1e5c75a3-5697-44e9-a65d-831b7c526108
Safe to exit. Follow the job status at: https://console.cloud.google.com/dataproc/jobs/1e5c75a3-5697-44e9-a65d-831b7c526108

                    <-----------------------------------------------------------------------------------
                    ------------------------------------------------------------------------------------                            
                                                      DATAPROC LOGS   
                    ------------------------------------------------------------------------------------                             
                    ------------------------------------------------------------------------------------>
                    
Running command: gcloud dataproc jobs wait  1e5c75a3-5697-44e9-a65d-831b7c526108 --region=us-central1
Waiting for job output...
25/02/11 03:03:35 WARN SparkConf: The configuration key 'spark.yarn.executor.failuresValidityInterval' has been deprecated as of Spark 3.5 and may be removed in the future. Please use the new key 'spark.executor.failuresValidityInterval' instead.
25/02/11 03:03:35 WARN SparkConf: The configuration key 'spark.yarn.executor.failuresValidityInterval' has been deprecated as of Spark 3.5 and may be removed in the future. Please use the new key 'spark.executor.failuresValidityInterval' instead.
Using warehouse dir: /tmp/1e5c75a3-5697-44e9-a65d-831b7c526108/local_warehouse
25/02/11 03:03:38 INFO HiveConf: Found configuration file file:/etc/hive/conf.dist/hive-site.xml
25/02/11 03:03:38 WARN SparkConf: The configuration key 'spark.yarn.executor.failuresValidityInterval' has been deprecated as of Spark 3.5 and may be removed in the future. Please use the new key 'spark.executor.failuresValidityInterval' instead.
25/02/11 03:03:38 INFO SparkEnv: Registering MapOutputTracker
25/02/11 03:03:38 INFO SparkEnv: Registering BlockManagerMaster
25/02/11 03:03:38 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
25/02/11 03:03:38 INFO SparkEnv: Registering OutputCommitCoordinator
25/02/11 03:03:39 INFO DataprocSparkPlugin: Registered 188 driver metrics
25/02/11 03:03:39 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at zipline-canary-cluster-m.us-central1-c.c.canary-443022.internal./10.128.0.17:8032
25/02/11 03:03:39 INFO AHSProxy: Connecting to Application History server at zipline-canary-cluster-m.us-central1-c.c.canary-443022.internal./10.128.0.17:10200
25/02/11 03:03:40 INFO Configuration: resource-types.xml not found
25/02/11 03:03:40 INFO ResourceUtils: Unable to find 'resource-types.xml'.
25/02/11 03:03:41 INFO YarnClientImpl: Submitted application application_1738197659103_0071
25/02/11 03:03:42 WARN SparkConf: The configuration key 'spark.yarn.executor.failuresValidityInterval' has been deprecated as of Spark 3.5 and may be removed in the future. Please use the new key 'spark.executor.failuresValidityInterval' instead.
25/02/11 03:03:42 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at zipline-canary-cluster-m.us-central1-c.c.canary-443022.internal./10.128.0.17:8030
25/02/11 03:03:43 INFO GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
25/02/11 03:03:44 INFO GoogleHadoopOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will *not* yet see flushed data for gs://dataproc-temp-us-central1-703996152583-pqtvfptb/5d9e94ed-7649-4828-8b64-e3d58632a5d0/spark-job-history/application_1738197659103_0071.inprogress [CONTEXT ratelimit_period="1 MINUTES" ]
2025/02/11 03:03:44 INFO  SparkSessionBuilder.scala:75 - Chronon logging system initialized. Overrides spark's configuration
2025/02/11 03:04:01 INFO  TableUtils.scala:195 - Found 29, between (2023-11-02, 2023-11-30) partitions for table: canary-443022.data.quickstart_purchases_v1_test
2025/02/11 03:04:10 INFO  TableUtils.scala:195 - Found 30, between (2023-11-01, 2023-11-30) partitions for table: data.purchases
2025/02/11 03:04:10 INFO  TableUtils.scala:619 - 
Unfilled range computation:
   Output table: canary-443022.data.quickstart_purchases_v1_test
   Missing output partitions: [2023-12-01,2023-12-02,2023-12-03,2023-12-04,2023-12-05,2023-12-06,2023-12-07,2023-12-08,2023-12-09,2023-12-10,2023-12-11,2023-12-12,2023-12-13,2023-12-14,2023-12-15,2023-12-16,2023-12-17,2023-12-18,2023-12-19,2023-12-20,2023-12-21,2023-12-22,2023-12-23,2023-12-24,2023-12-25,2023-12-26,2023-12-27,2023-12-28,2023-12-29,2023-12-30,2023-12-31,2024-01-01,2024-01-02,2024-01-03,2024-01-04,2024-01-05,2024-01-06,2024-01-07,2024-01-08,2024-01-09,2024-01-10,2024-01-11,2024-01-12,2024-01-13,2024-01-14,2024-01-15,2024-01-16,2024-01-17,2024-01-18,2024-01-19,2024-01-20,2024-01-21,2024-01-22,2024-01-23,2024-01-24,2024-01-25,2024-01-26,2024-01-27,2024-01-28,2024-01-29,2024-01-30,2024-01-31,2024-02-01,2024-02-02,2024-02-03,2024-02-04,2024-02-05,2024-02-06,2024-02-07,2024-02-08,2024-02-09,2024-02-10,2024-02-11,2024-02-12,2024-02-13,2024-02-14,2024-02-15,2024-02-16,2024-02-17,2024-02-18,2024-02-19,2024-02-20,2024-02-21,2024-02-22,2024-02-23,2024-02-24,2024-02-25,2024-02-26,2024-02-27,2024-02-28,2024-02-29,2024-03-01,2024-03-02,2024-03-03,2024-03-04,2024-03-05,2024-03-06,2024-03-07,2024-03-08,2024-03-09,2024-03-10,2024-03-11,2024-03-12,2024-03-13,2024-03-14,2024-03-15,2024-03-16,2024-03-17,2024-03-18,2024-03-19,2024-03-20,2024-03-21,2024-03-22,2024-03-23,2024-03-24,2024-03-25,2024-03-26,2024-03-27,2024-03-28,2024-03-29,2024-03-30,2024-03-31,2024-04-01,2024-04-02,2024-04-03,2024-04-04,2024-04-05,2024-04-06,2024-04-07,2024-04-08,2024-04-09,2024-04-10,2024-04-11,2024-04-12,2024-04-13,2024-04-14,2024-04-15,2024-04-16,2024-04-17,2024-04-18,2024-04-19,2024-04-20,2024-04-21,2024-04-22,2024-04-23,2024-04-24,2024-04-25,2024-04-26,2024-04-27,2024-04-28,2024-04-29,2024-04-30,2024-05-01,2024-05-02,2024-05-03,2024-05-04,2024-05-05,2024-05-06,2024-05-07,2024-05-08,2024-05-09,2024-05-10,2024-05-11,2024-05-12,2024-05-13,2024-05-14,2024-05-15,2024-05-16,2024-05-17,2024-05-18,2024-05-19,2024-05-20,2024-05-21,2024-05-22,2024-05-23,2024-05-24,2024-05-25,2024-05-26,2024-05-27,2024-05-28,2024-05-29,2024-05-30,2024-05-31,2024-06-01,2024-06-02,2024-06-03,2024-06-04,2024-06-05,2024-06-06,2024-06-07,2024-06-08,2024-06-09,2024-06-10,2024-06-11,2024-06-12,2024-06-13,2024-06-14,2024-06-15,2024-06-16,2024-06-17,2024-06-18,2024-06-19,2024-06-20,2024-06-21,2024-06-22,2024-06-23,2024-06-24,2024-06-25,2024-06-26,2024-06-27,2024-06-28,2024-06-29,2024-06-30,2024-07-01,2024-07-02,2024-07-03,2024-07-04,2024-07-05,2024-07-06,2024-07-07,2024-07-08,2024-07-09,2024-07-10,2024-07-11,2024-07-12,2024-07-13,2024-07-14,2024-07-15,2024-07-16,2024-07-17,2024-07-18,2024-07-19,2024-07-20,2024-07-21,2024-07-22,2024-07-23,2024-07-24,2024-07-25,2024-07-26,2024-07-27,2024-07-28,2024-07-29,2024-07-30,2024-07-31,2024-08-01,2024-08-02,2024-08-03,2024-08-04,2024-08-05,2024-08-06,2024-08-07,2024-08-08,2024-08-09,2024-08-10,2024-08-11,2024-08-12,2024-08-13,2024-08-14,2024-08-15,2024-08-16,2024-08-17,2024-08-18,2024-08-19,2024-08-20,2024-08-21,2024-08-22,2024-08-23,2024-08-24,2024-08-25,2024-08-26,2024-08-27,2024-08-28,2024-08-29,2024-08-30,2024-08-31,2024-09-01,2024-09-02,2024-09-03,2024-09-04,2024-09-05,2024-09-06,2024-09-07,2024-09-08,2024-09-09,2024-09-10,2024-09-11,2024-09-12,2024-09-13,2024-09-14,2024-09-15,2024-09-16,2024-09-17,2024-09-18,2024-09-19,2024-09-20,2024-09-21,2024-09-22,2024-09-23,2024-09-24,2024-09-25,2024-09-26,2024-09-27,2024-09-28,2024-09-29,2024-09-30,2024-10-01,2024-10-02,2024-10-03,2024-10-04,2024-10-05,2024-10-06,2024-10-07,2024-10-08,2024-10-09,2024-10-10,2024-10-11,2024-10-12,2024-10-13,2024-10-14,2024-10-15,2024-10-16,2024-10-17,2024-10-18,2024-10-19,2024-10-20,2024-10-21,2024-10-22,2024-10-23,2024-10-24,2024-10-25,2024-10-26,2024-10-27,2024-10-28,2024-10-29,2024-10-30,2024-10-31,2024-11-01,2024-11-02,2024-11-03,2024-11-04,2024-11-05,2024-11-06,2024-11-07,2024-11-08,2024-11-09,2024-11-10,2024-11-11,2024-11-12,2024-11-13,2024-11-14,2024-11-15,2024-11-16,2024-11-17,2024-11-18,2024-11-19,2024-11-20,2024-11-21,2024-11-22,2024-11-23,2024-11-24,2024-11-25,2024-11-26,2024-11-27,2024-11-28,2024-11-29,2024-11-30,2024-12-01,2024-12-02,2024-12-03,2024-12-04,2024-12-05,2024-12-06,2024-12-07,2024-12-08,2024-12-09,2024-12-10,2024-12-11,2024-12-12,2024-12-13,2024-12-14,2024-12-15,2024-12-16,2024-12-17,2024-12-18,2024-12-19,2024-12-20,2024-12-21,2024-12-22,2024-12-23,2024-12-24,2024-12-25,2024-12-26,2024-12-27,2024-12-28,2024-12-29,2024-12-30,2024-12-31,2025-01-01,2025-01-02,2025-01-03,2025-01-04,2025-01-05,2025-01-06,2025-01-07,2025-01-08,2025-01-09,2025-01-10,2025-01-11,2025-01-12,2025-01-13,2025-01-14,2025-01-15,2025-01-16,2025-01-17,2025-01-18,2025-01-19,2025-01-20,2025-01-21,2025-01-22,2025-01-23,2025-01-24,2025-01-25,2025-01-26,2025-01-27,2025-01-28,2025-01-29,2025-01-30,2025-01-31,2025-02-01,2025-02-02,2025-02-03,2025-02-04,2025-02-05,2025-02-06,2025-02-07,2025-02-08,2025-02-09,2025-02-10]
   Input tables: data.purchases
   Missing input partitions: [2023-12-01,2023-12-02,2023-12-03,2023-12-04,2023-12-05,2023-12-06,2023-12-07,2023-12-08,2023-12-09,2023-12-10,2023-12-11,2023-12-12,2023-12-13,2023-12-14,2023-12-15,2023-12-16,2023-12-17,2023-12-18,2023-12-19,2023-12-20,2023-12-21,2023-12-22,2023-12-23,2023-12-24,2023-12-25,2023-12-26,2023-12-27,2023-12-28,2023-12-29,2023-12-30,2023-12-31,2024-01-01,2024-01-02,2024-01-03,2024-01-04,2024-01-05,2024-01-06,2024-01-07,2024-01-08,2024-01-09,2024-01-10,2024-01-11,2024-01-12,2024-01-13,2024-01-14,2024-01-15,2024-01-16,2024-01-17,2024-01-18,2024-01-19,2024-01-20,2024-01-21,2024-01-22,2024-01-23,2024-01-24,2024-01-25,2024-01-26,2024-01-27,2024-01-28,2024-01-29,2024-01-30,2024-01-31,2024-02-01,2024-02-02,2024-02-03,2024-02-04,2024-02-05,2024-02-06,2024-02-07,2024-02-08,2024-02-09,2024-02-10,2024-02-11,2024-02-12,2024-02-13,2024-02-14,2024-02-15,2024-02-16,2024-02-17,2024-02-18,2024-02-19,2024-02-20,2024-02-21,2024-02-22,2024-02-23,2024-02-24,2024-02-25,2024-02-26,2024-02-27,2024-02-28,2024-02-29,2024-03-01,2024-03-02,2024-03-03,2024-03-04,2024-03-05,2024-03-06,2024-03-07,2024-03-08,2024-03-09,2024-03-10,2024-03-11,2024-03-12,2024-03-13,2024-03-14,2024-03-15,2024-03-16,2024-03-17,2024-03-18,2024-03-19,2024-03-20,2024-03-21,2024-03-22,2024-03-23,2024-03-24,2024-03-25,2024-03-26,2024-03-27,2024-03-28,2024-03-29,2024-03-30,2024-03-31,2024-04-01,2024-04-02,2024-04-03,2024-04-04,2024-04-05,2024-04-06,2024-04-07,2024-04-08,2024-04-09,2024-04-10,2024-04-11,2024-04-12,2024-04-13,2024-04-14,2024-04-15,2024-04-16,2024-04-17,2024-04-18,2024-04-19,2024-04-20,2024-04-21,2024-04-22,2024-04-23,2024-04-24,2024-04-25,2024-04-26,2024-04-27,2024-04-28,2024-04-29,2024-04-30,2024-05-01,2024-05-02,2024-05-03,2024-05-04,2024-05-05,2024-05-06,2024-05-07,2024-05-08,2024-05-09,2024-05-10,2024-05-11,2024-05-12,2024-05-13,2024-05-14,2024-05-15,2024-05-16,2024-05-17,2024-05-18,2024-05-19,2024-05-20,2024-05-21,2024-05-22,2024-05-23,2024-05-24,2024-05-25,2024-05-26,2024-05-27,2024-05-28,2024-05-29,2024-05-30,2024-05-31,2024-06-01,2024-06-02,2024-06-03,2024-06-04,2024-06-05,2024-06-06,2024-06-07,2024-06-08,2024-06-09,2024-06-10,2024-06-11,2024-06-12,2024-06-13,2024-06-14,2024-06-15,2024-06-16,2024-06-17,2024-06-18,2024-06-19,2024-06-20,2024-06-21,2024-06-22,2024-06-23,2024-06-24,2024-06-25,2024-06-26,2024-06-27,2024-06-28,2024-06-29,2024-06-30,2024-07-01,2024-07-02,2024-07-03,2024-07-04,2024-07-05,2024-07-06,2024-07-07,2024-07-08,2024-07-09,2024-07-10,2024-07-11,2024-07-12,2024-07-13,2024-07-14,2024-07-15,2024-07-16,2024-07-17,2024-07-18,2024-07-19,2024-07-20,2024-07-21,2024-07-22,2024-07-23,2024-07-24,2024-07-25,2024-07-26,2024-07-27,2024-07-28,2024-07-29,2024-07-30,2024-07-31,2024-08-01,2024-08-02,2024-08-03,2024-08-04,2024-08-05,2024-08-06,2024-08-07,2024-08-08,2024-08-09,2024-08-10,2024-08-11,2024-08-12,2024-08-13,2024-08-14,2024-08-15,2024-08-16,2024-08-17,2024-08-18,2024-08-19,2024-08-20,2024-08-21,2024-08-22,2024-08-23,2024-08-24,2024-08-25,2024-08-26,2024-08-27,2024-08-28,2024-08-29,2024-08-30,2024-08-31,2024-09-01,2024-09-02,2024-09-03,2024-09-04,2024-09-05,2024-09-06,2024-09-07,2024-09-08,2024-09-09,2024-09-10,2024-09-11,2024-09-12,2024-09-13,2024-09-14,2024-09-15,2024-09-16,2024-09-17,2024-09-18,2024-09-19,2024-09-20,2024-09-21,2024-09-22,2024-09-23,2024-09-24,2024-09-25,2024-09-26,2024-09-27,2024-09-28,2024-09-29,2024-09-30,2024-10-01,2024-10-02,2024-10-03,2024-10-04,2024-10-05,2024-10-06,2024-10-07,2024-10-08,2024-10-09,2024-10-10,2024-10-11,2024-10-12,2024-10-13,2024-10-14,2024-10-15,2024-10-16,2024-10-17,2024-10-18,2024-10-19,2024-10-20,2024-10-21,2024-10-22,2024-10-23,2024-10-24,2024-10-25,2024-10-26,2024-10-27,2024-10-28,2024-10-29,2024-10-30,2024-10-31,2024-11-01,2024-11-02,2024-11-03,2024-11-04,2024-11-05,2024-11-06,2024-11-07,2024-11-08,2024-11-09,2024-11-10,2024-11-11,2024-11-12,2024-11-13,2024-11-14,2024-11-15,2024-11-16,2024-11-17,2024-11-18,2024-11-19,2024-11-20,2024-11-21,2024-11-22,2024-11-23,2024-11-24,2024-11-25,2024-11-26,2024-11-27,2024-11-28,2024-11-29,2024-11-30,2024-12-01,2024-12-02,2024-12-03,2024-12-04,2024-12-05,2024-12-06,2024-12-07,2024-12-08,2024-12-09,2024-12-10,2024-12-11,2024-12-12,2024-12-13,2024-12-14,2024-12-15,2024-12-16,2024-12-17,2024-12-18,2024-12-19,2024-12-20,2024-12-21,2024-12-22,2024-12-23,2024-12-24,2024-12-25,2024-12-26,2024-12-27,2024-12-28,2024-12-29,2024-12-30,2024-12-31,2025-01-01,2025-01-02,2025-01-03,2025-01-04,2025-01-05,2025-01-06,2025-01-07,2025-01-08,2025-01-09,2025-01-10,2025-01-11,2025-01-12,2025-01-13,2025-01-14,2025-01-15,2025-01-16,2025-01-17,2025-01-18,2025-01-19,2025-01-20,2025-01-21,2025-01-22,2025-01-23,2025-01-24,2025-01-25,2025-01-26,2025-01-27,2025-01-28,2025-01-29,2025-01-30,2025-01-31,2025-02-01,2025-02-02,2025-02-03,2025-02-04,2025-02-05,2025-02-06,2025-02-07,2025-02-08,2025-02-09,2025-02-10]
   Unfilled Partitions: []
   Unfilled ranges: 

2025/02/11 03:04:10 INFO  GroupBy.scala:722 - Nothing to backfill for canary-443022.data.quickstart_purchases_v1_test - given
endPartition of 2025-02-10
backfill start of 2023-11-01
Exiting...
Job [1e5c75a3-5697-44e9-a65d-831b7c526108] finished successfully.
done: true
driverControlFilesUri: gs://dataproc-staging-us-central1-703996152583-lxespibx/google-cloud-dataproc-metainfo/5d9e94ed-7649-4828-8b64-e3d58632a5d0/jobs/1e5c75a3-5697-44e9-a65d-831b7c526108/
driverOutputResourceUri: gs://dataproc-staging-us-central1-703996152583-lxespibx/google-cloud-dataproc-metainfo/5d9e94ed-7649-4828-8b64-e3d58632a5d0/jobs/1e5c75a3-5697-44e9-a65d-831b7c526108/driveroutput
jobUuid: 1e5c75a3-5697-44e9-a65d-831b7c526108
placement:
  clusterName: zipline-canary-cluster
  clusterUuid: 5d9e94ed-7649-4828-8b64-e3d58632a5d0
reference:
  jobId: 1e5c75a3-5697-44e9-a65d-831b7c526108
  projectId: canary-443022
sparkJob:
  args:
  - group-by-backfill
  - --conf-path=purchases.v1_test
  - --end-date=2025-02-10
  - --conf-type=group_bys
  - --additional-conf-path=additional-confs.yaml
  - --is-gcp
  - --gcp-project-id=canary-443022
  - --gcp-bigtable-instance-id=zipline-canary-instance
  fileUris:
  - gs://zipline-warehouse-canary/metadata/purchases.v1_test
  - gs://zipline-artifacts-canary/confs/additional-confs.yaml
  jarFileUris:
  - gs://zipline-artifacts-canary/jars/cloud_gcp_lib_deploy.jar
  mainClass: ai.chronon.spark.Driver
status:
  state: DONE
  stateStartTime: '2025-02-11T03:04:13.983885Z'
statusHistory:
- state: PENDING
  stateStartTime: '2025-02-11T03:03:30.333322Z'
- state: SETUP_DONE
  stateStartTime: '2025-02-11T03:03:30.363428Z'
- details: Agent reported job success
  state: RUNNING
  stateStartTime: '2025-02-11T03:03:30.565778Z'
yarnApplications:
- name: groupBy_quickstart.purchases.v1_test_backfill
  progress: 1.0
  state: FINISHED
  trackingUrl: http://zipline-canary-cluster-m.us-central1-c.c.canary-443022.internal.:8088/proxy/application_1738197659103_0071/

```

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Improved user feedback with a direct monitoring URL for background job
status.

- **Improvements**
  - Enhanced error handling and output display during job submissions.
- Streamlined environment configuration retrieval for greater
reliability.
- Introduced color-coded terminal messaging for clearer status
indications.

- **Dependencies**
  - Updated core dependency libraries to support improved functionality.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Tests**
- Streamlined several test suites by standardizing their execution
without legacy tagging filters.
- Ensured that core test logic remains consistent while simplifying the
test execution process.

- **Chores**
- Removed redundant tagging functionalities to reduce complexity and
improve test maintainability.
- Increased test timeout from 900 seconds to 3000 seconds to allow for
longer test execution.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
## Summary

Allow setting partition column name in sources. Maps it to the default
partition name upon read and partition checking.

## Checklist
- [x] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enabled configurable partition columns in query, join, and data
generation operations for improved data partitioning.
- **Refactor**
- Streamlined partition handling and consolidated import structures to
enhance workflow efficiency.
- **Tests**
- Added test cases for verifying partition column functionality and
adjusted data generation volumes for better validation.
- Introduced new tests specifically for different partition columns to
ensure accurate handling of partitioned data.

These enhancements provide increased flexibility and accuracy in
managing partitioned datasets during data processing and join
operations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: ezvz <[email protected]>
Co-authored-by: Nikhil Simha <[email protected]>
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced a new query to retrieve purchase records with date range
filtering.
- Enhanced data retrieval by including additional contextual metadata
for improved insights.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit


- **New Features**
- Introduced dedicated testing workflows covering multiple system
components to enhance overall reliability.
- Added new test suites for various components to enhance testing
granularity.
- **Refactor**
- Streamlined code organization with improved package structures and
consolidated imports across test modules.
- **Chores**
- Upgraded automated testing configurations with optimized resource
settings for improved performance and stability.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update


<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Chores**
- Adjusted the test execution timeout setting from a longer duration to
900 seconds to ensure tests complete more promptly.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Thomas Chow <[email protected]>
…nd not pushed to remote (#385)

## Summary
I've been seeing that it's difficult to track what changes went into
artifacts we push to etsy and canary. Especially when it comes to
tracking performance regressions for spark jobs one day to the next.

Adding a check to not allow any pushes to any customer artifacts if the
branch is dirty. All changes need to at least be pushed to remote.

And adding a metadata tag of the commit and branch


## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Introduced consistency checks during the build and upload process to
verify that local changes are committed and branches are in sync.
- Enhanced artifact metadata now includes additional context about the
code state at the time of upload.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
… top (#380)

## Summary
While trying to read the updated beacon top topic we hit issues as the
number of avro fields is greater than the Spark codegen limit default of
100. Thanks to this the wholestage codegen code is incorrect and we
either end up with segfaults (unit tests) or garbled events (prod flink
jobs). This PR bumps the limit to allow us to read beacon top (374
fields) as well as adds an assert in Catalyst util's whole stage code
gen code to fail if we encounter this again in the future for a higher
number of fields than our current bumped limit.

## Checklist
- [X] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced data processing robustness with improved handling and early
error detection for large schemas.
  - Refined SQL query formatting for clearer logical conditions.

- **Tests**
  - Added a new validation for large schema deserialization.
  - Updated test definitions to improve structure and readability.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

- Make the thrift gen python executable, use `py_binary` to support
python generally

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Chores**
- Enhanced the build process for a key automation tool by streamlining
its execution and command handling, leading to improved overall build
reliability and performance.
- Transitioned the export mechanism of a Python script to a defined
executable binary target for better integration within the build system.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

Co-authored-by: Thomas Chow <[email protected]>
## Summary
- Release Notes:
https://spark.apache.org/releases/spark-release-3-5-4.html
- https://issues.apache.org/jira/browse/SPARK-49791 is a good one for
us.
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update


<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Chores**
- Upgraded underlying Apache Spark libraries to version 3.5.4,
delivering enhanced performance, stability, and compatibility. This
update improves processing efficiency and backend reliability, ensuring
smoother and more secure data operations. End-users may notice more
robust and responsive interactions as a result of these improvements,
further enhancing overall system performance.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Thomas Chow <[email protected]>
## Summary

- Even though I'm eager to get ahead here, let's not go too crazy and
accidentally shoot ourselves in the foot. Let's stay pinned to what our
clusters have (3.5.1) until those upgrade.



## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update


<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Chores**
- Updated core Spark libraries—impacting SQL, Hive, Streaming, and Avro
features—to version 3.5.1 to ensure enhanced stability and improved
integration across Spark-powered functionalities.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
Grant and I were chatting about the high number of hosts needed for the
beacon top Flink jobs (24). This is because the topic parallelism is 96
and we squeeze 4 slots per TM (so 96 / 4 = 24 hosts). Given that folks
often over provision Kafka topics in terms of partitions, going with a
default of scaling down by 1/4th. Will look into wiring up Flink
autoscaling as a follow up to not have this hardcoded.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Refactor**
- Optimized stream processing by refining the parallelism calculation.
The system now applies a scaling factor to better adjust the number of
active processing units, which may result in improved efficiency under
certain conditions.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Documentation**
- Clarified command instructions and added informational notes to set
expectations during initial builds.

- **New Features**
- Introduced new build options for modular construction of components,
including dedicated commands for hub and cloud modules.
  - Added an automated script to streamline the frontend build process.

- **Chores**
- Updated container setup and startup processes to utilize revised
deployment artifacts.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
- trim down tableutils
- add iceberg runtime dependency to cloud_gcp
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
  - Added a runtime dependency to enhance Spark processing.
  - Introduced a consolidated method for computing partition ranges.

- **Refactor**
- Streamlined import sections and simplified join analysis by removing
redundant permission checks.
  
- **Bug Fixes**
- Removed methods related to table permission checks, impacting access
control functionality.

- **Tests**
  - Removed an outdated test for table permission verification.
  
- **Chores**
- Updated the project’s dependency configuration to include the new
Spark runtime artifact.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
## Summary
Changed the backend code to only compute 3 percentiles (p5, p50, p95)
for returning to the frontend.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Bug Fixes**
- Enhanced statistical data processing to consistently handle cases with
missing values by using a robust placeholder, ensuring clearer
downstream analytics.
- Adjusted the percentile chart configuration so that the 95th, 50th,
and 5th percentiles are accurately rendered, providing more reliable
insights for users.
- Relaxed the null ratio validation in summary data, allowing for a
broader acceptance of null values, which may affect drift metric
interpretations.

- **New Features**
- Introduced methods for converting percentile strings to index values
and filtering percentiles based on user-defined requests, improving data
handling and representation.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
Changes to support builds/tests with both scala 2.12 and 2.13 versions.
By default we build against 2.12 version, pass "--config scala_2.13"
option to "bazel build/test" to override it.

ScalaFmt seems to be breaking for 2.13 using bazel rules_scala package,
[fix](bazel-contrib/rules_scala#1631) is already
deployed but a release with that change is not available yet, so
temporarily disabled ScalaFmt checks for 2.13 will enable later once the
fix is released.

## Checklist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit


- **New Features**
- Enabled flexible Scala version selection (2.12 and 2.13) for smoother
builds and enhanced compatibility.
- Introduced a default Scala version constant and a repository rule for
improved version management.
- Added support for additional Scala 2.13 dependencies in the build
configuration.

- **Refactor and Improvements**
- Streamlined build and dependency management for increased stability
and performance.
- Consolidated collection conversion utilities to boost reliability in
tests and runtime processing.
- Enhanced type safety and clarity in collection handling across various
modules.
- Improved handling of Scala collections and maps throughout the
codebase for better type consistency and safety.
- Updated method implementations to ensure explicit type conversions,
enhancing clarity and preventing runtime errors.
- Modified method signatures and internal logic to utilize `Seq` for
improved type clarity and consistency.
- Enhanced the `maven_artifact` function to accept an optional version
parameter for better dependency management.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

- #381 introduced the ability
to configure a partition column at the node-level. This PR simply fixes
a missed spot on the plumbing of the new StagingQuery attribute.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced the query builder to support specifying a partition column,
providing greater customization for query formation and partitioning.
- **Improvements**
- Improved handling of partition columns by introducing a fallback
mechanism to ensure valid values are used when necessary.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
## Summary
To add CI checks for making sure we are able to build and test all
modules on both scala 2.12 and 2.13 versions.

## Checklist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Chores**
- Updated automated testing workflows to support Scala 2.12 and added
new workflows for Scala 2.13, ensuring consistent testing for both Spark
and non-Spark modules.

- **Documentation**
- Enhanced build instructions with updated commands for creating Uber
Jars and new automation shortcuts to streamline code formatting,
committing, and pushing changes.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
Added pinning support for both our maven and spark repositories so we
don't have to resolve them during builds.

Going forward whenever we make any updates to the artifacts in either
maven or spark repositories, we would need to re-pin the changed repos
using following commands and check-in the updated json files.

```
REPIN=1 bazel run @maven//:pin
REPIN=1 bazel run @spark//:pin
```

## Checklist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Integrated enhanced repository management for Maven and Spark,
providing improved dependency installation.
- Added support for JSON configuration files for Maven and Spark
installations.

- **Chores**
- Updated documentation to include instructions on pinning Maven
artifacts and managing dependency versions effectively.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
A VSCode plugin for feature authoring that detects errors and uses data
sampling in order to speed up the iteration cycle. The goal is to reduce
the amount of memorizing commands, typing / clicking, waiting for
clusters to be spun up, and jobs to finish.

In this example, we have a complex expression operating on nested data.
The eval button appears above Chronon types.

When you click on the Eval button, it samples your data, runs your code
and shows errors or transformed result within seconds.



![zipline_vscode_plugin](https://github.com/user-attachments/assets/5ac56764-f6e7-4998-b5aa-1f4cabde42f9)


## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [x] Integration tested (see above)
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced a new Visual Studio Code extension that enhances Python
development.
- The extension displays an evaluation button alongside specific
assignment statements in Python files, allowing users to trigger
evaluation commands directly in the terminal.
- Added a command to execute evaluation actions related to Zipline AI
configurations.
  
- **Documentation**
  - Added a new LICENSE file containing the MIT License text.
  
- **Configuration**
- Introduced new configuration files for TypeScript and Webpack to
support the extension's development and build processes.
  
- **Exclusions**
- Updated `.gitignore` and added `.vscodeignore` to streamline version
control and packaging processes.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

Moved scala dependencies to separate scala_2_12 and scala_2_13
repositories so we can load the right repo based on config instead of
loading both.

## Checklist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Summary by CodeRabbit

- **Chores**
- Upgraded Scala dependencies to newer versions with updated
verification, ensuring improved stability.
- Removed outdated package references to streamline dependency
management.
- Introduced new repository configurations for Scala 2.12 and 2.13 to
enhance dependency management.
- Added `.gitignore` entry to exclude `node_modules` in the
`authoring/vscode` path.
  - Created `LICENSE` file with MIT License text for the new extension.
  
- **New Features**
- Introduced a Visual Studio Code extension with a CodeLens provider for
Python files, allowing users to evaluate variables directly in the
editor.

- **Refactor**
- Updated dependency declarations to utilize a new method for handling
Scala artifacts, improving consistency across the project.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Nikhil Simha <[email protected]>
tchow-zlai and others added 23 commits May 1, 2025 10:23
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update


<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":""}
```
-->

Co-authored-by: thomaschow <[email protected]>
## Summary
- Remove latest label view since it depends on some partition methods
taht are lightly used. We don't use this Label Join anyway anymore so
it's fine to deprecate.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Removed Features**
- Removed support for creating and managing "latest label" views and
their associated mapping logic.
- Eliminated utility methods for checking and retrieving all table
partitions.
- **Bug Fixes**
- Improved partition presence checks to include table reachability and
more explicit partition retrieval.
- **Breaking Changes**
- Updated the return type of partition parsing to preserve order and
allow duplicate keys.
- **Tests**
- Removed tests related to partition utilities and latest label mapping.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: thomaschow <[email protected]>
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Refactor**
- Improved handling of tables without partition columns to ensure
smoother data loading.
- The system now gracefully loads unpartitioned tables instead of
raising errors.

- **New Features**
- Added new data sources and group-by configurations for enhanced
purchase data aggregation.
- Introduced environment-specific upload and deletion of additional
BigQuery tables to support new group-by views.

- **Bug Fixes**
- Resolved issues where missing partition columns would previously cause
exceptions, enhancing reliability for various table types.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: thomaschow <[email protected]>
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Refactor**
- Updated partition range handling during group-by operations to use the
full specified range for backfill instead of dynamically detected
ranges.

- **Chores**
- Simplified backfill processing to cover the entire specified range
consistently.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: thomaschow <[email protected]>
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Tests**
- Improved and expanded tests to verify partition range filtering works
consistently between BigQuery native tables and views.
- Added a new test to ensure partition filtering over specific date
ranges returns matching results for both views and tables.
- Renamed and enhanced an existing test for better clarity and coverage.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: thomaschow <[email protected]>
…r Fetcher threadpool (#726)

## Summary
PR to swap our metrics reporter from statsd to open telemetry metrics.
We need otel to allow us to capture metrics in Etsy without the need of
a prometheus statsd exporter sidecar that they've seen issues with
occasionally. Otel in general is a popular metrics ingestion interface
with a number of supported backends (e.g. prom / datadog / gcloud / aws
cloudwatch). Wiring up Otel also enables us to set up traces and spans
in the repo in the future.
Broad changes:
- Decouple bulk of the metrics reporting logic from the Metrics.Context.
The metrics reporter we use is pluggable. Currently this is just the
OpenTelemetry but in principle we can support others in the future.
- Online module creates the appropriate [otel
SDK](https://opentelemetry.io/docs/languages/java/sdk/) - either we use
the [Http provider or the Prometheus Http
server](https://opentelemetry.io/docs/languages/java/configuration/#properties-exporters).
We need the Http provider to plug into Vert.x as their Micrometer
integration works with that. The Prom http server is what Etsy is keen
we use.

## Checklist
- [ ] Added Unit Tests
- [X] Covered by existing CI
- [X] Integration tested
- [ ] Documentation update

Tested via docker container and a local instance of open telemetry:
Start up fetcher docker svc
```
docker run -v ~/.config/gcloud/application_default_credentials.json:/gcp/credentials.json  -p 9000:9000  -e "GCP_PROJECT_ID=canary-443022"  -e "GOOGLE_CLOUD_PROJECT=canary-443022"  -e "GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance"  -e "EXPORTER_OTLP_ENDPOINT=http://host.docker.internal:4318"  -e GOOGLE_APPLICATION_CREDENTIALS=/gcp/credentials.json  zipline-fetcher:latest
```

And then otel:
```
./otelcol --config otel-collector-config.yaml
...
```

We see:
```
2025-04-18T17:35:37.351-0400	info	ResourceMetrics #0
Resource SchemaURL: 
Resource attributes:
     -> service.name: Str(ai.chronon)
     -> telemetry.sdk.language: Str(java)
     -> telemetry.sdk.name: Str(opentelemetry)
     -> telemetry.sdk.version: Str(1.49.0)
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope ai.chronon 3.7.0-M11
Metric #0
Descriptor:
     -> Name: kv_store.bigtable.cache.insert
     -> Description: 
     -> Unit: 
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> dataset: Str(TableId{tableId=CHRONON_METADATA})
     -> environment: Str(kv_store)
     -> production: Str(false)
StartTimestamp: 2025-04-18 21:31:52.180857637 +0000 UTC
Timestamp: 2025-04-18 21:35:37.18442138 +0000 UTC
Value: 1
Metric #1
Descriptor:
     -> Name: kv_store.bigtable.multiGet.latency
     -> Description: 
     -> Unit: 
     -> DataType: Histogram
     -> AggregationTemporality: Cumulative
HistogramDataPoints #0
Data point attributes:
     -> dataset: Str(TableId{tableId=CHRONON_METADATA})
     -> environment: Str(kv_store)
     -> production: Str(false)
StartTimestamp: 2025-04-18 21:31:52.180857637 +0000 UTC
Timestamp: 2025-04-18 21:35:37.18442138 +0000 UTC
Count: 1
Sum: 229.000000
Min: 229.000000
Max: 229.000000
ExplicitBounds #0: 0.000000
...
Buckets #0, Count: 0
...
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced OpenTelemetry-based metrics reporting throughout the
platform, replacing the previous StatsD approach.
- Added a Dockerfile and startup script for a new Fetcher service,
supporting both AWS and GCP integrations with configurable metrics
export.
- Enhanced thread pool monitoring with a new executor that provides
detailed metrics on task execution and queue status.

- **Improvements**
- Metrics tags are now structured as key-value maps, improving clarity
and flexibility.
- Metrics reporting is now context-aware, supporting per-dataset and
per-table metrics.
- Increased thread pool queue capacity for better throughput under load.
- Replaced StatsD metrics configuration with OpenTelemetry OTLP in
service launcher and build configurations.

- **Bug Fixes**
- Improved error handling and logging in metrics reporting and thread
pool management.

- **Chores**
- Updated dependencies to include OpenTelemetry, Micrometer OTLP
registry, Prometheus, OkHttp, and Kotlin libraries.
- Refactored build and test configurations to support new telemetry
libraries and remove deprecated dependencies.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
)

## Summary
Needed for orchestration service till we move over these thrift files

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Introduced support for starting workflows with detailed parameters,
including node name, branch, date range, and partition specifications.
- Responses now include the workflow identifier when a workflow is
started.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

- We aren't currently using this, as the cache level is set to `NONE`.
To simplify things we'll just remove places where it was referenced.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Refactor**
- Simplified join computation by removing internal caching and improving
error handling.
- **Chores**
  - Eliminated caching-related code to enhance system maintainability.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: thomaschow <[email protected]>
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced a new test for joining GCP-based training set data,
supporting different join configurations.
- Added a new backfill step for join operations in the data processing
pipeline, with environment-specific configuration handling.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: thomaschow <[email protected]>
## Summary

Putting this up again - #684

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Added support for the Avro logical type `timestamp-millis` in schema
and value conversions, enabling better handling of timestamp fields.
- Enhanced BigQuery integration with a new test to verify correct
timestamp conversions based on configuration settings.

- **Documentation**
- Added detailed comments explaining the mapping behavior of timestamp
types and relevant configuration flags.

- **Refactor**
- Improved logging structure for serialized object size calculations for
better readability.
  - Minor formatting and consistency improvements in test assertions.

- **Style**
  - Removed unnecessary trailing whitespace for cleaner code.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

- Drop tables in the join integration tests. 

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Chores**
- Enhanced cleanup process to remove additional BigQuery tables in
"canary" and "dev" environments.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: thomaschow <[email protected]>
## Summary

- Bigquery connector doesn't support reading large datasets well.
Instead, we'll leverage [BigQuery
Exports](https://cloud.google.com/bigquery/docs/exporting-data#sql)
which is an official approach to getting data out of bigquery. As part
of table loading, we'll first export the data to a GCS warehouse
location, which should be a quick operation. Upon testing this on
production data, it seems very quick (<10 seconds to extract 100GB).
- There are some nuances in handling partition columns, particularly
system defined pseudocolumns. since they don't show up in the projection
if you do a `SELECT * ...`, we'll need to [alias
them](https://cloud.google.com/bigquery/docs/querying-partitioned-tables#query_an_ingestion-time_partitioned_table).
The logic is as follows:

1. Given a table, we check the information schema to see if it is
partitioned
2. If partitioned, check if it's a system defined partition column.
3. 
(a) If it's a system defined partition column, we'll alias that column
to an internal chronon reserved name. If it's not, we'll simply just do
a `SELECT * ` with no alias.
(b) If not partitioned (eg in the case of a view), we'll just do a
simple `SELECT * ` and apply the "partition" filters requested by the
reader.
4. After the data gets exported with the possible alias in (3a), we'll
read it back as a spark dataframe and rename the aliased column to the
system defined partition column name. The rename is a noop if the
internal column alias is not present (in the absence of a system defined
partition column).


We'll use the reserved catalog conf:
`spark.sql.catalog.<catalog_name>.warehouse` as the root location to do
exports, which is configured per project.



## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Summary by CodeRabbit

- **New Features**
- Added support for exporting BigQuery table data as Parquet files to
Google Cloud Storage, improving data loading into Spark.

- **Refactor**
- Replaced partition-based BigQuery reads with export-to-GCS approach
for enhanced performance and reliability.
- Centralized catalog retrieval logic for table formats, removing
deprecated methods and improving consistency.
  - Updated test cases to align with new catalog retrieval method.
  - Cleaned up import statements for better code organization.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: thomaschow <[email protected]>
## Summary
Pull out the DDB KV store rate limits as it causes docker startup errors
due to class / jar version issues.

## Checklist
- [ ] Added Unit Tests
- [X] Covered by existing CI
- [X] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Refactor**
  - Removed rate limiting functionality from DynamoDB operations.
  - Eliminated dependency on Guava library from the build configuration.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

Create workflow to trigger platform subtree pull reusable workflow.

Also deletes Push To Canary workflow as it will be triggered in the
platform repo.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Chores**
- Added a new workflow to automate triggering subtree updates in an
external platform repository when changes are pushed to the main branch.
- Removed the "Push To Canary" workflow, discontinuing automated
artifact builds, canary deployments, integration tests, and related
notifications for the main branch.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

## Checklist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced advanced planning and orchestration capabilities for
offline data processing, including new planners for join and group-by
operations.
- Added utilities for metadata layering and enriched partition
specification handling.
- Introduced a structured approach to offline join planning with
detailed metadata and node composition.
- Added new traits and classes to support batch run contexts and node
execution.
- Added comprehensive table dependency generation based on joins,
group-bys, and sources.

- **Improvements**
- Expanded partitioning metadata in API definitions for richer temporal
semantics.
- Updated orchestration schemas with new node types and renamed entities
for clarity.
- Improved naming conventions by replacing "Keyword" suffixes with
"Folder" across configurations.
- Streamlined internal logic for table and job naming, dependency
resolution, and window operations.
  - Enhanced error handling and logging in table utilities.
- Adjusted snapshot accuracy logic in merge operations for event data
models.
  - Modified tile drift calculation to use a fixed timestamp offset.

- **Bug Fixes**
  - Corrected logic for snapshot accuracy handling in merge operations.

- **Refactor**
- Centralized utility methods for window arithmetic and partition
specification.
  - Consolidated job context parameters in join part jobs.
- Restricted visibility of label join methods for better encapsulation.
- Replaced generic bootstrap job classes with join-specific
implementations.
- Simplified import statements and method signatures for improved
clarity.
- Delegated left source table name computation to join offline planner.

- **Chores**
  - Updated `.gitignore` to exclude additional directories.
- Removed legacy configuration-to-node conversion code and associated
dependency resolver tests.

- **Documentation**
- Improved code comments and formatting for new and existing classes and
methods.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

## Checklist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added optional fields for partition format and partition interval to
query definitions, allowing greater flexibility in specifying
partitioning behavior.

- **Refactor**
- Simplified partition specification usage across the platform by
consolidating partition column, format, and interval into a single
object.
- Updated multiple interfaces and methods to derive partition column and
related metadata from the unified partition specification, reducing
explicit parameter passing.
- Streamlined class and method signatures to improve consistency and
maintainability.
- Removed deprecated partition specs and adjusted related logic to use
the updated partition specification format.
- Enhanced SQL clause generation to internally use partition
specification details, removing the need to pass partition column
explicitly.
- Adjusted data generation and query construction logic to rely on the
updated partition specification model.
- Simplified construction and usage of partition specifications in data
processing and metadata components.
- Improved handling of partition specs in Spark-related utilities and
jobs for consistency.

- **Chores**
- Updated tests and internal utilities to align with the new partition
specification structure.
- Reduced test data volume in join tests to optimize test runtime and
resource usage.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Thomas Chow <[email protected]>
)

## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update


<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Refactor**
- Simplified test logic for handling partition dates, making tests rely
on the expected data's partition date.
	- Cleaned up and reordered import statements for improved clarity.
- **Tests**
- Updated test method signatures and calls to streamline date handling
in test comparisons.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: thomaschow <[email protected]>
## Summary

## Checklist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Documentation**
- Updated the README with a concise overview and disclaimer about this
repository being a fork of Airbnb’s Chronon.
- Highlighted key differences including additional connectors, upgraded
libraries, performance improvements, and specialized runners.
  - Clarified deployment options and maintenance practices.
- Removed detailed usage instructions, examples, and conceptual
explanations.
- Noted that full documentation is forthcoming and invited users to
contact maintainers for early access.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: ezvz <[email protected]>
## Summary

Adding an additional partitions argument to table deps

Produces dependency that looks like this: `"customJson":
"{\"airflow_dependencies\": [{\"name\":
\"wf_sample_namespace_sample_table\", \"spec\":
\"sample_namespace.sample_table/ds={{ ds }}/_HR=23:00\"}]}",`

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Added support for specifying additional partition information when
defining table dependencies, allowing for more flexible and detailed
dependency configurations.

- **Tests**
- Updated test cases to include examples with additional partition
specifications in table dependencies.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: ezvz <[email protected]>
## Summary

Add `https://` protocol to open link to
`https://github.com/airbnb/chronon` and not
`https://github.com/zipline-ai/chronon/blob/main/github.com/airbnb/chronon`

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update


<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Documentation**
- Updated the README to include the "https://" protocol in the GitHub
URL for Airbnb's Chronon repository.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Sean Lynch <[email protected]>
## Summary

My strategy to use a reuseble workflow doesn't work anymore because a
private workflow isn't accessible from a public repo. Instead of
triggering the sync, this simply runs it from here.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Chores**
- Removed the previous canary release workflow, including automated
build, test, and artifact deployment steps for AWS and GCP.
- Introduced a new workflow to automate synchronization of code from the
chronon repository into the platform repository via subtree pull and
push operations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Copy link

coderabbitai bot commented May 8, 2025

Walkthrough

This update refactors the partition range utility method from getRangesToFill to getRangeToFill across core Spark join logic and related tests. Partition format handling is improved in data generation, and error checking is made more idiomatic. Test cases are updated for consistency with these changes.

Changes

File(s) Change Summary
spark/src/main/scala/ai/chronon/spark/Analyzer.scala
spark/src/main/scala/ai/chronon/spark/JoinBase.scala
spark/src/main/scala/ai/chronon/spark/JoinUtils.scala
Replaced all usage of JoinUtils.getRangesToFill with JoinUtils.getRangeToFill. Updated method name in utility object. Added logic in JoinBase to filter fillable partitions and a require check for input partitions. Modified JoinUtils.leftDf to translate partition ranges and added implicit TableUtils. Minor formatting fixes. No signature changes except utility method rename.
spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala Changed an assert to a require for input validation when determining partition range start. No other logic changed.
spark/src/test/scala/ai/chronon/spark/test/DataFrameGen.scala Added optional partitionFormat parameter to gen, events, and entities methods. Calls now propagate partition format for consistent partitioning. Adjusted internal logic to use format in partition spec and transformations.
spark/src/test/scala/ai/chronon/spark/test/LocalTableExporterTest.scala Updated test to use partition column and partition format from TableUtils when generating DataFrames. Removed redundant TableUtils instantiation.
spark/src/test/scala/ai/chronon/spark/test/join/JoinTest.scala Modified test to use explicit partition format "yyyyMMdd" for weight table and query. Changed join configuration to use a deep copy with a shifted date range.
spark/src/test/scala/ai/chronon/spark/test/join/JoinUtilsTest.scala Updated tests to call JoinUtils.getRangeToFill instead of the old method. No other logic changed.
api/src/main/scala/ai/chronon/api/DataRange.scala Added translate method to PartitionRange to convert partition range strings between partition specs.
api/src/main/scala/ai/chronon/api/PartitionSpec.scala Added intervalWindow method to get a Window for day/hour spans, throwing on unsupported intervals. Added translate method to convert date strings between partition specs.
spark/src/main/scala/ai/chronon/spark/Extensions.scala Added translatePartitionSpec method to DataframeOps to rename and reformat partition columns. Modified SourceSparkOps to take implicit TableUtils and added accessors for partition column, format, interval, and spec. Updated imports to support new functionality.
spark/src/main/scala/ai/chronon/spark/GroupBy.scala Introduced implicit TableUtils and PartitionSpec in getIntersectedRange. Translated query range to source partition spec. Simplified partition conditions construction. Added partition spec translation on source DataFrame. Reformatted logging in computeBackfill. Added import for SourceSparkOps. No signature changes.

Sequence Diagram(s)

sequenceDiagram
    participant Test as Test Suite
    participant DataGen as DataFrameGen
    participant Join as Join Logic
    participant Utils as JoinUtils
    participant Table as TableUtils

    Test->>DataGen: gen(..., partitionFormat)
    DataGen->>DataGen: Use partitionFormat in partition spec
    Test->>Join: computeJoinOpt(...)
    Join->>Utils: getRangeToFill(...)
    Utils->>Table: get partition range info
    Join->>Join: Filter fillable partitions
    Join-->>Test: Return join results
Loading

Possibly related PRs

Suggested reviewers

  • varant-zlai
  • piyush-zlai

Poem

🐇✨
Partition specs align with grace,
Joins now fill their rightful space.
Formats flow through data streams,
Tests confirm our coding dreams.
Require checks guard the gate,
Cleaner code, we celebrate!
🎉🚂

Tip

⚡️ Faster reviews with caching
  • CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.

Enjoy the performance boost—your workflow just got faster.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between a804788 and c33cbd3.

📒 Files selected for processing (2)
  • spark/src/main/scala/ai/chronon/spark/GroupBy.scala (6 hunks)
  • spark/src/main/scala/ai/chronon/spark/JoinUtils.scala (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • spark/src/main/scala/ai/chronon/spark/JoinUtils.scala
⏰ Context from checks skipped due to timeout of 90000ms (31)
  • GitHub Check: streaming_tests
  • GitHub Check: groupby_tests
  • GitHub Check: streaming_tests
  • GitHub Check: join_tests
  • GitHub Check: service_tests
  • GitHub Check: analyzer_tests
  • GitHub Check: fetcher_tests
  • GitHub Check: online_tests
  • GitHub Check: join_tests
  • GitHub Check: batch_tests
  • GitHub Check: cloud_gcp_tests
  • GitHub Check: groupby_tests
  • GitHub Check: analyzer_tests
  • GitHub Check: cloud_aws_tests
  • GitHub Check: fetcher_tests
  • GitHub Check: spark_tests
  • GitHub Check: batch_tests
  • GitHub Check: api_tests
  • GitHub Check: service_tests
  • GitHub Check: spark_tests
  • GitHub Check: flink_tests
  • GitHub Check: service_commons_tests
  • GitHub Check: aggregator_tests
  • GitHub Check: cloud_gcp_tests
  • GitHub Check: cloud_aws_tests
  • GitHub Check: api_tests
  • GitHub Check: online_tests
  • GitHub Check: flink_tests
  • GitHub Check: aggregator_tests
  • GitHub Check: scala_compile_fmt_fix
  • GitHub Check: enforce_triggered_workflows
🔇 Additional comments (6)
spark/src/main/scala/ai/chronon/spark/GroupBy.scala (6)

42-42: LGTM: New import for partition spec translation support.


587-590: Good use of implicits for cleaner partition handling.

Making TableUtils and SourcePartitionSpec implicit simplifies downstream code and adds clarity.


592-596: Effective query range usage aligns with source partitioning.

Using translated query range ensures proper partition alignment with source.


641-642: Consistent use of implicit TableUtils.

Consistent with earlier pattern in getIntersectedRange.


671-671: Simplified partition condition generation.

Direct use of whereClauses improves code conciseness.


697-707:

✅ Verification successful

Partition spec translation ensures consistent formats.

Final translatePartitionSpec call ensures DataFrame uses correct partition format.


🏁 Script executed:

#!/bin/bash
# Check for DataFrame translation uses across the codebase
rg -A 2 "translatePartitionSpec" --type scala

Length of output: 830


Partition spec translation consistent across codebase

  • This call mirrors usages in JoinUtils.scala and Extensions.scala, ensuring the DataFrame always has the correct partition format.
  • No further changes needed.

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
spark/src/main/scala/ai/chronon/spark/JoinBase.scala (1)

290-299: Added partition existence validation.

Good addition to filter partitions that actually exist in the source table.

Consider uncommenting the require statement once the implementation is fully tested, or remove it if not needed.

-//    require(
-//      fillableRanges.nonEmpty,
-//      s"""No relevant input partitions present in ${joinConfCloned.left.table}
-//         |on join.left for the requested range ${rangeToFill.start} - ${rangeToFill.end} """.stripMargin
-//    )
+    if (fillableRanges.isEmpty) {
+      logger.warn(s"""No relevant input partitions present in ${joinConfCloned.left.table}
+        |on join.left for the requested range ${rangeToFill.start} - ${rangeToFill.end} """.stripMargin)
+    }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3779e99 and 4ce3bcb.

📒 Files selected for processing (8)
  • spark/src/main/scala/ai/chronon/spark/Analyzer.scala (1 hunks)
  • spark/src/main/scala/ai/chronon/spark/JoinBase.scala (2 hunks)
  • spark/src/main/scala/ai/chronon/spark/JoinUtils.scala (1 hunks)
  • spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala (1 hunks)
  • spark/src/test/scala/ai/chronon/spark/test/DataFrameGen.scala (3 hunks)
  • spark/src/test/scala/ai/chronon/spark/test/LocalTableExporterTest.scala (2 hunks)
  • spark/src/test/scala/ai/chronon/spark/test/join/JoinTest.scala (2 hunks)
  • spark/src/test/scala/ai/chronon/spark/test/join/JoinUtilsTest.scala (2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
spark/src/test/scala/ai/chronon/spark/test/LocalTableExporterTest.scala (2)
spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala (2)
  • TableUtils (45-584)
  • TableUtils (586-588)
spark/src/test/scala/ai/chronon/spark/test/DataFrameGen.scala (2)
  • DataFrameGen (39-177)
  • gen (41-57)
spark/src/main/scala/ai/chronon/spark/JoinBase.scala (3)
spark/src/main/scala/ai/chronon/spark/JoinUtils.scala (2)
  • JoinUtils (39-532)
  • getRangeToFill (137-166)
spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala (1)
  • partitions (131-156)
api/src/main/scala/ai/chronon/api/DataRange.scala (1)
  • partitions (90-95)
⏰ Context from checks skipped due to timeout of 90000ms (15)
  • GitHub Check: streaming_tests
  • GitHub Check: join_tests
  • GitHub Check: streaming_tests
  • GitHub Check: groupby_tests
  • GitHub Check: join_tests
  • GitHub Check: fetcher_tests
  • GitHub Check: groupby_tests
  • GitHub Check: analyzer_tests
  • GitHub Check: fetcher_tests
  • GitHub Check: spark_tests
  • GitHub Check: analyzer_tests
  • GitHub Check: batch_tests
  • GitHub Check: spark_tests
  • GitHub Check: batch_tests
  • GitHub Check: scala_compile_fmt_fix
🔇 Additional comments (15)
spark/src/main/scala/ai/chronon/spark/Analyzer.scala (1)

267-267: Updated method name to align with codebase standardization.

Method name changed from getRangesToFill to getRangeToFill for consistency across the codebase.

spark/src/test/scala/ai/chronon/spark/test/join/JoinUtilsTest.scala (2)

288-288: Method name updated to match implementation.

Updated test to use renamed method getRangeToFill instead of getRangesToFill.


305-305: Method name updated to match implementation.

Updated test with override parameter to use renamed method getRangeToFill.

spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala (1)

336-342: Improved error handling using require instead of assert.

Changed from assertion to requirement validation for better runtime behavior. This is more idiomatic Scala and ensures validation happens even in production environments where assertions might be disabled.

spark/src/test/scala/ai/chronon/spark/test/LocalTableExporterTest.scala (2)

92-93: Added explicit partition formatting for test data.

Moved TableUtils initialization earlier and explicitly passed partition information to the test data generator for consistency.


105-106: Consistent partition formatting for weight test data.

Using the same TableUtils instance to ensure consistent partition formatting across test data generation.

spark/src/main/scala/ai/chronon/spark/JoinUtils.scala (1)

137-137: Method renaming from getRangesToFill to getRangeToFill.

The rename provides clearer indication that the method returns a single range.

spark/src/main/scala/ai/chronon/spark/JoinBase.scala (1)

134-138: Method call updated to match renamed method.

The method call has been correctly updated to match the renamed method in JoinUtils.

spark/src/test/scala/ai/chronon/spark/test/join/JoinTest.scala (3)

316-319: Added partition format support to DataFrameGen call.

Good addition of explicit partition format for consistent test behavior.


321-324: Added partition format to query.

Ensures partition format consistency between data generation and query execution.


366-370: Uses deep copy for join configuration.

Improved test to properly clone configuration and use future date for testing.

spark/src/test/scala/ai/chronon/spark/test/DataFrameGen.scala (4)

41-49: Enhanced gen method with partition format support.

Good implementation for flexible partition format handling.


64-71: Updated events method to support partition format.

Properly handles partition format in timestamp conversion.


79-86: Updated entities method with partition format support.

Consistent implementation across generator methods.


108-112: Updated mutations method to pass partition format.

Ensures format consistency in mutation test data generation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants