Skip to content

feat: add support for long partition columns #761

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 438 commits into
base: main
Choose a base branch
from

Conversation

nikhil-zlai
Copy link
Contributor

@nikhil-zlai nikhil-zlai commented May 10, 2025

Summary

Checklist

  • Added Unit Tests
  • Covered by existing CI
  • Integration tested
  • Documentation update

Summary by CodeRabbit

  • Refactor
    • Improved handling of partition columns by applying data type-specific formatting and clearer error reporting for unsupported types during data processing.
  • Tests
    • Added tests to verify correct handling and formatting of partition columns with string and long data types.

varant-zlai and others added 30 commits February 13, 2025 23:23
## Summary

Allow setting partition column name in sources. Maps it to the default
partition name upon read and partition checking.

## Checklist
- [x] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enabled configurable partition columns in query, join, and data
generation operations for improved data partitioning.
- **Refactor**
- Streamlined partition handling and consolidated import structures to
enhance workflow efficiency.
- **Tests**
- Added test cases for verifying partition column functionality and
adjusted data generation volumes for better validation.
- Introduced new tests specifically for different partition columns to
ensure accurate handling of partitioned data.

These enhancements provide increased flexibility and accuracy in
managing partitioned datasets during data processing and join
operations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: ezvz <[email protected]>
Co-authored-by: Nikhil Simha <[email protected]>
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced a new query to retrieve purchase records with date range
filtering.
- Enhanced data retrieval by including additional contextual metadata
for improved insights.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit


- **New Features**
- Introduced dedicated testing workflows covering multiple system
components to enhance overall reliability.
- Added new test suites for various components to enhance testing
granularity.
- **Refactor**
- Streamlined code organization with improved package structures and
consolidated imports across test modules.
- **Chores**
- Upgraded automated testing configurations with optimized resource
settings for improved performance and stability.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update


<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Chores**
- Adjusted the test execution timeout setting from a longer duration to
900 seconds to ensure tests complete more promptly.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Thomas Chow <[email protected]>
…nd not pushed to remote (#385)

## Summary
I've been seeing that it's difficult to track what changes went into
artifacts we push to etsy and canary. Especially when it comes to
tracking performance regressions for spark jobs one day to the next.

Adding a check to not allow any pushes to any customer artifacts if the
branch is dirty. All changes need to at least be pushed to remote.

And adding a metadata tag of the commit and branch


## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Introduced consistency checks during the build and upload process to
verify that local changes are committed and branches are in sync.
- Enhanced artifact metadata now includes additional context about the
code state at the time of upload.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
… top (#380)

## Summary
While trying to read the updated beacon top topic we hit issues as the
number of avro fields is greater than the Spark codegen limit default of
100. Thanks to this the wholestage codegen code is incorrect and we
either end up with segfaults (unit tests) or garbled events (prod flink
jobs). This PR bumps the limit to allow us to read beacon top (374
fields) as well as adds an assert in Catalyst util's whole stage code
gen code to fail if we encounter this again in the future for a higher
number of fields than our current bumped limit.

## Checklist
- [X] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced data processing robustness with improved handling and early
error detection for large schemas.
  - Refined SQL query formatting for clearer logical conditions.

- **Tests**
  - Added a new validation for large schema deserialization.
  - Updated test definitions to improve structure and readability.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

- Make the thrift gen python executable, use `py_binary` to support
python generally

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Chores**
- Enhanced the build process for a key automation tool by streamlining
its execution and command handling, leading to improved overall build
reliability and performance.
- Transitioned the export mechanism of a Python script to a defined
executable binary target for better integration within the build system.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

Co-authored-by: Thomas Chow <[email protected]>
## Summary
- Release Notes:
https://spark.apache.org/releases/spark-release-3-5-4.html
- https://issues.apache.org/jira/browse/SPARK-49791 is a good one for
us.
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update


<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Chores**
- Upgraded underlying Apache Spark libraries to version 3.5.4,
delivering enhanced performance, stability, and compatibility. This
update improves processing efficiency and backend reliability, ensuring
smoother and more secure data operations. End-users may notice more
robust and responsive interactions as a result of these improvements,
further enhancing overall system performance.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Thomas Chow <[email protected]>
## Summary

- Even though I'm eager to get ahead here, let's not go too crazy and
accidentally shoot ourselves in the foot. Let's stay pinned to what our
clusters have (3.5.1) until those upgrade.



## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update


<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Chores**
- Updated core Spark libraries—impacting SQL, Hive, Streaming, and Avro
features—to version 3.5.1 to ensure enhanced stability and improved
integration across Spark-powered functionalities.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
Grant and I were chatting about the high number of hosts needed for the
beacon top Flink jobs (24). This is because the topic parallelism is 96
and we squeeze 4 slots per TM (so 96 / 4 = 24 hosts). Given that folks
often over provision Kafka topics in terms of partitions, going with a
default of scaling down by 1/4th. Will look into wiring up Flink
autoscaling as a follow up to not have this hardcoded.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Refactor**
- Optimized stream processing by refining the parallelism calculation.
The system now applies a scaling factor to better adjust the number of
active processing units, which may result in improved efficiency under
certain conditions.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Documentation**
- Clarified command instructions and added informational notes to set
expectations during initial builds.

- **New Features**
- Introduced new build options for modular construction of components,
including dedicated commands for hub and cloud modules.
  - Added an automated script to streamline the frontend build process.

- **Chores**
- Updated container setup and startup processes to utilize revised
deployment artifacts.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
- trim down tableutils
- add iceberg runtime dependency to cloud_gcp
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
  - Added a runtime dependency to enhance Spark processing.
  - Introduced a consolidated method for computing partition ranges.

- **Refactor**
- Streamlined import sections and simplified join analysis by removing
redundant permission checks.
  
- **Bug Fixes**
- Removed methods related to table permission checks, impacting access
control functionality.

- **Tests**
  - Removed an outdated test for table permission verification.
  
- **Chores**
- Updated the project’s dependency configuration to include the new
Spark runtime artifact.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
## Summary
Changed the backend code to only compute 3 percentiles (p5, p50, p95)
for returning to the frontend.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Bug Fixes**
- Enhanced statistical data processing to consistently handle cases with
missing values by using a robust placeholder, ensuring clearer
downstream analytics.
- Adjusted the percentile chart configuration so that the 95th, 50th,
and 5th percentiles are accurately rendered, providing more reliable
insights for users.
- Relaxed the null ratio validation in summary data, allowing for a
broader acceptance of null values, which may affect drift metric
interpretations.

- **New Features**
- Introduced methods for converting percentile strings to index values
and filtering percentiles based on user-defined requests, improving data
handling and representation.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
Changes to support builds/tests with both scala 2.12 and 2.13 versions.
By default we build against 2.12 version, pass "--config scala_2.13"
option to "bazel build/test" to override it.

ScalaFmt seems to be breaking for 2.13 using bazel rules_scala package,
[fix](bazel-contrib/rules_scala#1631) is already
deployed but a release with that change is not available yet, so
temporarily disabled ScalaFmt checks for 2.13 will enable later once the
fix is released.

## Checklist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit


- **New Features**
- Enabled flexible Scala version selection (2.12 and 2.13) for smoother
builds and enhanced compatibility.
- Introduced a default Scala version constant and a repository rule for
improved version management.
- Added support for additional Scala 2.13 dependencies in the build
configuration.

- **Refactor and Improvements**
- Streamlined build and dependency management for increased stability
and performance.
- Consolidated collection conversion utilities to boost reliability in
tests and runtime processing.
- Enhanced type safety and clarity in collection handling across various
modules.
- Improved handling of Scala collections and maps throughout the
codebase for better type consistency and safety.
- Updated method implementations to ensure explicit type conversions,
enhancing clarity and preventing runtime errors.
- Modified method signatures and internal logic to utilize `Seq` for
improved type clarity and consistency.
- Enhanced the `maven_artifact` function to accept an optional version
parameter for better dependency management.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

- #381 introduced the ability
to configure a partition column at the node-level. This PR simply fixes
a missed spot on the plumbing of the new StagingQuery attribute.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced the query builder to support specifying a partition column,
providing greater customization for query formation and partitioning.
- **Improvements**
- Improved handling of partition columns by introducing a fallback
mechanism to ensure valid values are used when necessary.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
## Summary
To add CI checks for making sure we are able to build and test all
modules on both scala 2.12 and 2.13 versions.

## Checklist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Chores**
- Updated automated testing workflows to support Scala 2.12 and added
new workflows for Scala 2.13, ensuring consistent testing for both Spark
and non-Spark modules.

- **Documentation**
- Enhanced build instructions with updated commands for creating Uber
Jars and new automation shortcuts to streamline code formatting,
committing, and pushing changes.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
Added pinning support for both our maven and spark repositories so we
don't have to resolve them during builds.

Going forward whenever we make any updates to the artifacts in either
maven or spark repositories, we would need to re-pin the changed repos
using following commands and check-in the updated json files.

```
REPIN=1 bazel run @maven//:pin
REPIN=1 bazel run @spark//:pin
```

## Checklist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Integrated enhanced repository management for Maven and Spark,
providing improved dependency installation.
- Added support for JSON configuration files for Maven and Spark
installations.

- **Chores**
- Updated documentation to include instructions on pinning Maven
artifacts and managing dependency versions effectively.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
A VSCode plugin for feature authoring that detects errors and uses data
sampling in order to speed up the iteration cycle. The goal is to reduce
the amount of memorizing commands, typing / clicking, waiting for
clusters to be spun up, and jobs to finish.

In this example, we have a complex expression operating on nested data.
The eval button appears above Chronon types.

When you click on the Eval button, it samples your data, runs your code
and shows errors or transformed result within seconds.



![zipline_vscode_plugin](https://github.com/user-attachments/assets/5ac56764-f6e7-4998-b5aa-1f4cabde42f9)


## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [x] Integration tested (see above)
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced a new Visual Studio Code extension that enhances Python
development.
- The extension displays an evaluation button alongside specific
assignment statements in Python files, allowing users to trigger
evaluation commands directly in the terminal.
- Added a command to execute evaluation actions related to Zipline AI
configurations.
  
- **Documentation**
  - Added a new LICENSE file containing the MIT License text.
  
- **Configuration**
- Introduced new configuration files for TypeScript and Webpack to
support the extension's development and build processes.
  
- **Exclusions**
- Updated `.gitignore` and added `.vscodeignore` to streamline version
control and packaging processes.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

Moved scala dependencies to separate scala_2_12 and scala_2_13
repositories so we can load the right repo based on config instead of
loading both.

## Checklist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Summary by CodeRabbit

- **Chores**
- Upgraded Scala dependencies to newer versions with updated
verification, ensuring improved stability.
- Removed outdated package references to streamline dependency
management.
- Introduced new repository configurations for Scala 2.12 and 2.13 to
enhance dependency management.
- Added `.gitignore` entry to exclude `node_modules` in the
`authoring/vscode` path.
  - Created `LICENSE` file with MIT License text for the new extension.
  
- **New Features**
- Introduced a Visual Studio Code extension with a CodeLens provider for
Python files, allowing users to evaluate variables directly in the
editor.

- **Refactor**
- Updated dependency declarations to utilize a new method for handling
Scala artifacts, improving consistency across the project.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Nikhil Simha <[email protected]>
## Summary
Adds AWS build and push commands to the distribution script.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
  - Introduced an automated quickstart process for GCP deployments.
- Enhanced the build and upload tool with flexible command-line options,
supporting artifact creation for both AWS and GCP environments.
  - Added a new script for running the Zipline quickstart on GCP.

- **Refactor**
  - Updated the AWS quickstart process to ensure consistent execution.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
…FilePath and replacing `/` to `.` in MetaData names (#398)

## Summary

^^^

Tested on the etsy laptop.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Bug Fixes**
- Improved error handling to explicitly report when configuration values
are missing.
- **New Features**
- Introduced standardized constants for various configuration types,
ensuring consistent key naming.
- **Refactor**
- Unified metadata processing by using direct metadata names instead of
file paths.
- Enhanced type safety in configuration options for clearer and more
reliable behavior.
- **Tests**
- Updated test cases and parameters to reflect the improved metadata and
configuration handling.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Reverts #373

Passing in options to push to only one customer is broken.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Refactor**
- Streamlined the deployment process to automatically build and upload
artifacts exclusively to Google Cloud Platform.
- Removed configuration options and handling for an alternative cloud
provider, resulting in a simpler, more focused workflow.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
building join output schema should belong to metadata store - and also
reduces the size of fetcher.

## Checklist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced an optimized caching mechanism for data join operations,
resulting in improved performance and reliability.
- Added new methods to facilitate the creation and management of join
codecs.
  
- **Bug Fixes**
- Enhanced error handling for join codec operations, ensuring clearer
context for failures.
  
- **Documentation**
- Improved code readability and clarity through updated comments and
method signatures.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
Add support to run the fetcher service in docker. Also add rails to
publish to docker hub as a private image -
[ziplineai/chronon-fetcher](https://hub.docker.com/repository/docker/ziplineai/chronon-fetcher)

I wasn't able to sort out logback / log4j2 logging as there's a lot of
deps messing things up - Vert.x supports JUL configs and that is
seemingly working so starting with that for now.

Tested with:
```
docker run -v ~/.config/gcloud/application_default_credentials.json:/gcp/credentials.json \
 -p 9000:9000 \
 -e "GCP_PROJECT_ID=canary-443022" \
 -e "GOOGLE_CLOUD_PROJECT=canary-443022" \
 -e "GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance" \
 -e "STATSD_HOST=127.0.0.1" \
 -e GOOGLE_APPLICATION_CREDENTIALS=/gcp/credentials.json \
 ziplineai/chronon-fetcher
```

And then you can `curl http://localhost:9000/ping`

On Etsy side just need to swap out the project and bt instance id and
then can curl the actual join:
```
curl -X POST http://localhost:9000/v1/fetch/join/search.ranking.v1_web_zipline_cdc_and_beacon_external -H 'Content-Type: application/json' -d '[{"listing_id":"632126370","shop_id":"53908089","shipping_profile_id":"235561688531"}]'
{"results":[{"status":"Success","entityKeys":{"listing_id":"632126370","shop_id":"53908089","shipping_profile_id":"235561688531"},"features":{...
```

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [X] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Added an automation script that streamlines the container image build
and publication process with improved error handling.
- Introduced a new container configuration that installs essential
dependencies, sets environment variables, and incorporates a health
check for enhanced reliability.
- Implemented a robust logging setup that standardizes console and file
outputs with log rotation.
- Provided a startup script for the service that verifies required
settings and applies platform-specific options for seamless execution.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

Adds the ability to push artifacts to aws in addition to gcp. Also adds
ability to specify specific customer ids to push to.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced a new automation script that streamlines the process of
building artifacts and deploying them to both AWS and GCP with improved
error handling and user confirmation.

- **Chores**
- Removed a legacy artifact upload script that previously handled only
GCP deployments.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

- Supporting StagingQueries for configurable compute engines. To support
BigQuery, the simplest way is to just write bigquery sql and run it on
bq to create the final table. Let's first make the API change.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Summary by CodeRabbit

- **New Features**
- Added an option for users to specify the compute engine when
processing queries, offering choices such as Spark and BigQuery.
- Introduced validation to ensure that queries run only with the
designated engine.

- **Style**
  - Streamlined code organization for enhanced readability.
  - Consolidated and reordered import statements for improved clarity.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
## Summary
fetcher has grown over time into a large file with many large functions
that are hard to work with. This refactoring doesn't change any
functionality - just placement.

Made some of the scala code more idiomatic - if(try.isFailed) - vs
try.recoverWith
Made Metadata methods more explicit
FetcherBase -> JoinPartFetcher + GroupByFetcher + GroupByResponseHandler
Added fetch context - to replace 10 constructor params


## Checklist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit


- **New Features**
- Introduced a unified configuration context that enhances data
fetching, including improved group-by and join operations with more
robust error handling.
- Added a new `FetchContext` class to manage fetching operations and
execution contexts.
- Implemented a new `GroupByFetcher` class for efficient group-by data
retrieval.
- **Refactor**
- Upgraded serialization and deserialization to use a more efficient,
compact protocol.
- Standardized API definitions and type declarations across modules to
improve clarity and maintainability.
- Enhanced error handling in various methods to provide more informative
messages.
- **Chores**
	- Removed outdated utilities and reorganized dependency imports.
	- Updated test suites to align with the refactored architecture.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

- Staging query should in theory already work for external tables
without additional code changes as long as we do some setup work to pin
up a view first.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: Thomas Chow <[email protected]>
## Summary
The existing aggregations configure the items sketch incorrectly. Split
it into two one that works purely with skewed data, and one that tries
to best-effort collect most frequent items.

## Checklist
- [x] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced new utility functions to streamline expression composition
and cleanup.
  - Enhanced aggregation descriptions for clearer operation choices.
  - Added new aggregation types for improved data analysis.

- **Refactor**
- Revamped frequency analysis logic with improved error handling and
optimized sizing.
- Replaced legacy histogram approaches with a more robust frequent item
detection mechanism.

- **Tests**
- Added tests to validate heavy hitter detection and skewed data
scenarios, while removing obsolete histogram tests.
  - Updated existing tests to reflect changes in aggregation parameters.

- **Chores**
  - Removed deprecated interactive modules for a leaner deployment.

- **Configuration**
- Adjusted default aggregation parameters for more consistent
processing, including changes to the `k` value in multiple
configurations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
chewy-zlai and others added 20 commits May 6, 2025 16:05
## Summary

Create workflow to trigger platform subtree pull reusable workflow.

Also deletes Push To Canary workflow as it will be triggered in the
platform repo.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Chores**
- Added a new workflow to automate triggering subtree updates in an
external platform repository when changes are pushed to the main branch.
- Removed the "Push To Canary" workflow, discontinuing automated
artifact builds, canary deployments, integration tests, and related
notifications for the main branch.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

## Checklist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced advanced planning and orchestration capabilities for
offline data processing, including new planners for join and group-by
operations.
- Added utilities for metadata layering and enriched partition
specification handling.
- Introduced a structured approach to offline join planning with
detailed metadata and node composition.
- Added new traits and classes to support batch run contexts and node
execution.
- Added comprehensive table dependency generation based on joins,
group-bys, and sources.

- **Improvements**
- Expanded partitioning metadata in API definitions for richer temporal
semantics.
- Updated orchestration schemas with new node types and renamed entities
for clarity.
- Improved naming conventions by replacing "Keyword" suffixes with
"Folder" across configurations.
- Streamlined internal logic for table and job naming, dependency
resolution, and window operations.
  - Enhanced error handling and logging in table utilities.
- Adjusted snapshot accuracy logic in merge operations for event data
models.
  - Modified tile drift calculation to use a fixed timestamp offset.

- **Bug Fixes**
  - Corrected logic for snapshot accuracy handling in merge operations.

- **Refactor**
- Centralized utility methods for window arithmetic and partition
specification.
  - Consolidated job context parameters in join part jobs.
- Restricted visibility of label join methods for better encapsulation.
- Replaced generic bootstrap job classes with join-specific
implementations.
- Simplified import statements and method signatures for improved
clarity.
- Delegated left source table name computation to join offline planner.

- **Chores**
  - Updated `.gitignore` to exclude additional directories.
- Removed legacy configuration-to-node conversion code and associated
dependency resolver tests.

- **Documentation**
- Improved code comments and formatting for new and existing classes and
methods.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

## Checklist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added optional fields for partition format and partition interval to
query definitions, allowing greater flexibility in specifying
partitioning behavior.

- **Refactor**
- Simplified partition specification usage across the platform by
consolidating partition column, format, and interval into a single
object.
- Updated multiple interfaces and methods to derive partition column and
related metadata from the unified partition specification, reducing
explicit parameter passing.
- Streamlined class and method signatures to improve consistency and
maintainability.
- Removed deprecated partition specs and adjusted related logic to use
the updated partition specification format.
- Enhanced SQL clause generation to internally use partition
specification details, removing the need to pass partition column
explicitly.
- Adjusted data generation and query construction logic to rely on the
updated partition specification model.
- Simplified construction and usage of partition specifications in data
processing and metadata components.
- Improved handling of partition specs in Spark-related utilities and
jobs for consistency.

- **Chores**
- Updated tests and internal utilities to align with the new partition
specification structure.
- Reduced test data volume in join tests to optimize test runtime and
resource usage.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Thomas Chow <[email protected]>
)

## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update


<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Refactor**
- Simplified test logic for handling partition dates, making tests rely
on the expected data's partition date.
	- Cleaned up and reordered import statements for improved clarity.
- **Tests**
- Updated test method signatures and calls to streamline date handling
in test comparisons.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: thomaschow <[email protected]>
## Summary

## Checklist
- [ ] Added Unit Tests
- [x] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Documentation**
- Updated the README with a concise overview and disclaimer about this
repository being a fork of Airbnb’s Chronon.
- Highlighted key differences including additional connectors, upgraded
libraries, performance improvements, and specialized runners.
  - Clarified deployment options and maintenance practices.
- Removed detailed usage instructions, examples, and conceptual
explanations.
- Noted that full documentation is forthcoming and invited users to
contact maintainers for early access.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: ezvz <[email protected]>
## Summary

Adding an additional partitions argument to table deps

Produces dependency that looks like this: `"customJson":
"{\"airflow_dependencies\": [{\"name\":
\"wf_sample_namespace_sample_table\", \"spec\":
\"sample_namespace.sample_table/ds={{ ds }}/_HR=23:00\"}]}",`

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Added support for specifying additional partition information when
defining table dependencies, allowing for more flexible and detailed
dependency configurations.

- **Tests**
- Updated test cases to include examples with additional partition
specifications in table dependencies.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: ezvz <[email protected]>
## Summary

Add `https://` protocol to open link to
`https://github.com/airbnb/chronon` and not
`https://github.com/zipline-ai/chronon/blob/main/github.com/airbnb/chronon`

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update


<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Documentation**
- Updated the README to include the "https://" protocol in the GitHub
URL for Airbnb's Chronon repository.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Sean Lynch <[email protected]>
## Summary

My strategy to use a reuseble workflow doesn't work anymore because a
private workflow isn't accessible from a public repo. Instead of
triggering the sync, this simply runs it from here.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Chores**
- Removed the previous canary release workflow, including automated
build, test, and artifact deployment steps for AWS and GCP.
- Introduced a new workflow to automate synchronization of code from the
chronon repository into the platform repository via subtree pull and
push operations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

Add partition_format to python Query object

## Checklist
- [x] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added support for specifying a partition format when using the Query
function.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

Add additional sub partitions to wait for in Query to ultimately compile
into GroupBy and Join

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [x] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added support for specifying additional sub-partitions when defining
data dependencies, allowing for more granular detection of data arrival.
- **Documentation**
- Updated function and parameter documentation to reflect new options
for specifying sub-partitions in queries.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: ezvz <[email protected]>
## Summary

As we will be releasing from chronon, this change brings back the canary
build and testing.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Chores**
- Introduced a new automated workflow for continuous integration and
deployment to canary environments on AWS and GCP.
- Added integration tests and artifact uploads for both platforms, with
Slack notifications for build or test failures.
- Enhanced artifact tracking with detailed metadata and automated
cleanup after deployment.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update


<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Style**
  - Reorganized import statements for improved readability.

- **Chores**
- Removed debugging print statements from partition insertion to clean
up console output.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: thomaschow <[email protected]>
## Summary

Run push_to_platform on pull request merge only. Also use default
message


## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Chores**
- Updated workflow to run only after a pull request is merged into the
main branch, instead of on every push.
- Adjusted the commit message behavior for subtree updates to use the
default message.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Chores**
- Removed the synthetic dataset generation script for browser and device
fingerprinting data.
- Removed related test configurations and documentation for AWS Zipline
and Plaid data processing.
- Updated AWS release workflow to exclude the "plaid" customer ID from
S3 uploads.
- Cleaned up commented-out AWS S3 and Glue deletion commands in
deployment scripts.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->

---------

Co-authored-by: thomaschow <[email protected]>
## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Chores**
- Removed references to "etsy" as a customer ID from workflows, scripts,
and documentation.
- Deleted test and configuration files related to "etsy" and sample
teams.
- Updated Avro schema namespaces and default values from "com.etsy" to
"com.customer" and related URLs.
	- Improved indentation and formatting in sample configuration files.
- **Tests**
- Updated test arguments and removed obsolete test data related to
"etsy".

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
…-passing-candidate to line up with publish_release (#760)

## Summary

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Chores**
- Updated storage paths for artifact uploads to cloud storage in
deployment workflows.

- **Documentation**
- Corrected a type annotation in the documentation for a query
parameter.

- **Tests**
  - Enhanced a test to include and verify a new query parameter.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Copy link

coderabbitai bot commented May 10, 2025

Walkthrough

The change refactors the scanDfBase method in TableUtils.scala to handle partition column type conversions explicitly. It now applies type-specific transformations or throws an error for unsupported types, instead of always using date_format, before coalescing the DataFrame.

Changes

File Change Summary
spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala Refactored scanDfBase to inspect partition column type, apply type-specific formatting, and error on unsupported types. No public API changes.
spark/src/test/scala/ai/chronon/spark/test/TableUtilsTest.scala Added tests verifying partition column handling for string and long types, including data insertion and validation with cleanup.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant TableUtils

    Caller->>TableUtils: scanDfBase(df, partitionCol, config)
    alt Partition column exists
        TableUtils->>TableUtils: Inspect partition column type
        alt DateType/StringType/TimestampType
            TableUtils->>TableUtils: Apply date_format
        else LongType
            TableUtils->>TableUtils: Apply from_unixtime
        else Unsupported type
            TableUtils->>Caller: Throw UnsupportedOperationException
        end
    else Partition column missing
        TableUtils->>TableUtils: Leave DataFrame unchanged
    end
    TableUtils->>TableUtils: Coalesce DataFrame
    TableUtils-->>Caller: Return DataFrame
Loading

Possibly related PRs

  • feat: read and write date types #192: Modifies repartitionAndWriteInternal to convert partition columns to date format before repartitioning, related to partition column date/time handling.

Suggested reviewers

  • nikhil-zlai
  • piyush-zlai

Poem

Partition types now checked with care,
No more silent errors lurking there.
Dates, longs, and strings—each gets its due,
Unsupported? We’ll shout at you!
Coalesced and tidy, the DataFrame flows—
Onward, TableUtils, as logic grows!
🗂️✨

Tip

⚡️ Faster reviews with caching
  • CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.

Enjoy the performance boost—your workflow just got faster.


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

}).coalesce(coalesceFactor * parallelism)
val partitionFieldType = df.schema.find(_.name == partitionColumn).map(_.dataType)

val adjustedDf = partitionFieldType match {
Copy link
Collaborator

@tchow-zlai tchow-zlai May 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately now that we use exports the export will just map the timestamps to long type 😂 technically the BQ connector will handle these schema mappings correctly so that we don't have to handle this ourselves. This came about cuz of the new export flow I guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I realized that is what must be going on. bq be crazy

nikhil-zlai and others added 4 commits May 11, 2025 08:18
…mapping (#728)

## Summary

Updating the JoinSchemaResponse to include a mapping from feature ->
listing key. This PR updates our JoinSchemaResponse to include a value
info case class with these details.

## Checklist
- [X] Added Unit Tests
- [X] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Summary by CodeRabbit

- **New Features**
- Added detailed metadata for join value fields, including feature
names, group names, prefixes, left keys, and schema descriptions, now
available in join schema responses.
- **Bug Fixes**
- Improved consistency and validation between join configuration keys
and value field metadata.
- **Tests**
- Enhanced and added tests to validate the presence and correctness of
value field metadata in join schema responses.
- Introduced new test suites covering fetcher failure scenarios and
metadata store functionality.
- Refactored existing fetcher tests to use external utility methods for
data generation.
- Added utility methods for generating deterministic, random, and
event-only test data configurations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
spark/src/test/scala/ai/chronon/spark/test/TableUtilsTest.scala (3)

648-696: Good implementation of long partition column testing.

Test successfully verifies long timestamps are converted to formatted date strings.

Consider adding more test cases for edge dates:

  List(
    Row(1L, 2, 1747068965000L)
+   // Add tests for edge cases like:
+   // Row(1L, 2, 0L),                 // Epoch start
+   // Row(1L, 2, 32503680000000L)     // Year 3000
  )

649-650: Use unique table names between tests.

Both tests use identical table names which could cause issues if cleanup fails.

-val tableName = "db.test_partition_column_types"
+val tableName = "db.test_partition_column_long_types"

665-666: Add comment explaining timestamp value.

Document why this specific timestamp was chosen and what date it represents.

        List(
-         Row(1L, 2, 1747068965000L)
+         Row(1L, 2, 1747068965000L) // 2025-05-12 in milliseconds since epoch
        )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 9257a91 and c1db037.

📒 Files selected for processing (2)
  • spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala (3 hunks)
  • spark/src/test/scala/ai/chronon/spark/test/TableUtilsTest.scala (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala
⏰ Context from checks skipped due to timeout of 90000ms (16)
  • GitHub Check: streaming_tests
  • GitHub Check: groupby_tests
  • GitHub Check: streaming_tests
  • GitHub Check: analyzer_tests
  • GitHub Check: analyzer_tests
  • GitHub Check: join_tests
  • GitHub Check: join_tests
  • GitHub Check: batch_tests
  • GitHub Check: fetcher_tests
  • GitHub Check: groupby_tests
  • GitHub Check: spark_tests
  • GitHub Check: fetcher_tests
  • GitHub Check: batch_tests
  • GitHub Check: spark_tests
  • GitHub Check: scala_compile_fmt_fix
  • GitHub Check: enforce_triggered_workflows
🔇 Additional comments (1)
spark/src/test/scala/ai/chronon/spark/test/TableUtilsTest.scala (1)

617-646: LGTM! Test verifies string partition columns work correctly.

Basic test ensures string partition columns function properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants