feat: add support for long partition columns #761

nikhil-zlai · 2025-05-10T04:38:05Z

Summary

Checklist

Added Unit Tests
Covered by existing CI
Integration tested
Documentation update

Summary by CodeRabbit

Refactor
- Improved handling of partition columns by applying data type-specific formatting and clearer error reporting for unsupported types during data processing.
Tests
- Added tests to verify correct handling and formatting of partition columns with string and long data types.

## Summary Allow setting partition column name in sources. Maps it to the default partition name upon read and partition checking. ## Checklist - [x] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Enabled configurable partition columns in query, join, and data generation operations for improved data partitioning. - **Refactor** - Streamlined partition handling and consolidated import structures to enhance workflow efficiency. - **Tests** - Added test cases for verifying partition column functionality and adjusted data generation volumes for better validation. - Introduced new tests specifically for different partition columns to ensure accurate handling of partitioned data. These enhancements provide increased flexibility and accuracy in managing partitioned datasets during data processing and join operations.  --------- Co-authored-by: ezvz <[email protected]> Co-authored-by: Nikhil Simha <[email protected]>

## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced a new query to retrieve purchase records with date range filtering. - Enhanced data retrieval by including additional contextual metadata for improved insights.   --------- Co-authored-by: Thomas Chow <[email protected]>

## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced dedicated testing workflows covering multiple system components to enhance overall reliability. - Added new test suites for various components to enhance testing granularity. - **Refactor** - Streamlined code organization with improved package structures and consolidated imports across test modules. - **Chores** - Upgraded automated testing configurations with optimized resource settings for improved performance and stability.   --------- Co-authored-by: Thomas Chow <[email protected]>

## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update   ## Summary by CodeRabbit - **Chores** - Adjusted the test execution timeout setting from a longer duration to 900 seconds to ensure tests complete more promptly.  Co-authored-by: Thomas Chow <[email protected]>

…nd not pushed to remote (#385) ## Summary I've been seeing that it's difficult to track what changes went into artifacts we push to etsy and canary. Especially when it comes to tracking performance regressions for spark jobs one day to the next. Adding a check to not allow any pushes to any customer artifacts if the branch is dirty. All changes need to at least be pushed to remote. And adding a metadata tag of the commit and branch ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced consistency checks during the build and upload process to verify that local changes are committed and branches are in sync. - Enhanced artifact metadata now includes additional context about the code state at the time of upload.

… top (#380) ## Summary While trying to read the updated beacon top topic we hit issues as the number of avro fields is greater than the Spark codegen limit default of 100. Thanks to this the wholestage codegen code is incorrect and we either end up with segfaults (unit tests) or garbled events (prod flink jobs). This PR bumps the limit to allow us to read beacon top (374 fields) as well as adds an assert in Catalyst util's whole stage code gen code to fail if we encounter this again in the future for a higher number of fields than our current bumped limit. ## Checklist - [X] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Enhanced data processing robustness with improved handling and early error detection for large schemas. - Refined SQL query formatting for clearer logical conditions. - **Tests** - Added a new validation for large schema deserialization. - Updated test definitions to improve structure and readability.

## Summary - Make the thrift gen python executable, use `py_binary` to support python generally ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Chores** - Enhanced the build process for a key automation tool by streamlining its execution and command handling, leading to improved overall build reliability and performance. - Transitioned the export mechanism of a Python script to a defined executable binary target for better integration within the build system.   Co-authored-by: Thomas Chow <[email protected]>

## Summary - Release Notes: https://spark.apache.org/releases/spark-release-3-5-4.html - https://issues.apache.org/jira/browse/SPARK-49791 is a good one for us. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update   ## Summary by CodeRabbit - **Chores** - Upgraded underlying Apache Spark libraries to version 3.5.4, delivering enhanced performance, stability, and compatibility. This update improves processing efficiency and backend reliability, ensuring smoother and more secure data operations. End-users may notice more robust and responsive interactions as a result of these improvements, further enhancing overall system performance.  Co-authored-by: Thomas Chow <[email protected]>

## Summary - Even though I'm eager to get ahead here, let's not go too crazy and accidentally shoot ourselves in the foot. Let's stay pinned to what our clusters have (3.5.1) until those upgrade. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update   ## Summary by CodeRabbit - **Chores** - Updated core Spark libraries—impacting SQL, Hive, Streaming, and Avro features—to version 3.5.1 to ensure enhanced stability and improved integration across Spark-powered functionalities.

## Summary Grant and I were chatting about the high number of hosts needed for the beacon top Flink jobs (24). This is because the topic parallelism is 96 and we squeeze 4 slots per TM (so 96 / 4 = 24 hosts). Given that folks often over provision Kafka topics in terms of partitions, going with a default of scaling down by 1/4th. Will look into wiring up Flink autoscaling as a follow up to not have this hardcoded. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Refactor** - Optimized stream processing by refining the parallelism calculation. The system now applies a scaling factor to better adjust the number of active processing units, which may result in improved efficiency under certain conditions.

## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Documentation** - Clarified command instructions and added informational notes to set expectations during initial builds. - **New Features** - Introduced new build options for modular construction of components, including dedicated commands for hub and cloud modules. - Added an automated script to streamline the frontend build process. - **Chores** - Updated container setup and startup processes to utilize revised deployment artifacts.

## Summary - trim down tableutils - add iceberg runtime dependency to cloud_gcp ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Added a runtime dependency to enhance Spark processing. - Introduced a consolidated method for computing partition ranges. - **Refactor** - Streamlined import sections and simplified join analysis by removing redundant permission checks. - **Bug Fixes** - Removed methods related to table permission checks, impacting access control functionality. - **Tests** - Removed an outdated test for table permission verification. - **Chores** - Updated the project’s dependency configuration to include the new Spark runtime artifact.   --------- Co-authored-by: Thomas Chow <[email protected]>

## Summary Changed the backend code to only compute 3 percentiles (p5, p50, p95) for returning to the frontend. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Bug Fixes** - Enhanced statistical data processing to consistently handle cases with missing values by using a robust placeholder, ensuring clearer downstream analytics. - Adjusted the percentile chart configuration so that the 95th, 50th, and 5th percentiles are accurately rendered, providing more reliable insights for users. - Relaxed the null ratio validation in summary data, allowing for a broader acceptance of null values, which may affect drift metric interpretations. - **New Features** - Introduced methods for converting percentile strings to index values and filtering percentiles based on user-defined requests, improving data handling and representation.

## Summary Changes to support builds/tests with both scala 2.12 and 2.13 versions. By default we build against 2.12 version, pass "--config scala_2.13" option to "bazel build/test" to override it. ScalaFmt seems to be breaking for 2.13 using bazel rules_scala package, [fix](bazel-contrib/rules_scala#1631) is already deployed but a release with that change is not available yet, so temporarily disabled ScalaFmt checks for 2.13 will enable later once the fix is released. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Enabled flexible Scala version selection (2.12 and 2.13) for smoother builds and enhanced compatibility. - Introduced a default Scala version constant and a repository rule for improved version management. - Added support for additional Scala 2.13 dependencies in the build configuration. - **Refactor and Improvements** - Streamlined build and dependency management for increased stability and performance. - Consolidated collection conversion utilities to boost reliability in tests and runtime processing. - Enhanced type safety and clarity in collection handling across various modules. - Improved handling of Scala collections and maps throughout the codebase for better type consistency and safety. - Updated method implementations to ensure explicit type conversions, enhancing clarity and preventing runtime errors. - Modified method signatures and internal logic to utilize `Seq` for improved type clarity and consistency. - Enhanced the `maven_artifact` function to accept an optional version parameter for better dependency management.

## Summary - #381 introduced the ability to configure a partition column at the node-level. This PR simply fixes a missed spot on the plumbing of the new StagingQuery attribute. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Enhanced the query builder to support specifying a partition column, providing greater customization for query formation and partitioning. - **Improvements** - Improved handling of partition columns by introducing a fallback mechanism to ensure valid values are used when necessary.   --------- Co-authored-by: Thomas Chow <[email protected]>

## Summary To add CI checks for making sure we are able to build and test all modules on both scala 2.12 and 2.13 versions. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Chores** - Updated automated testing workflows to support Scala 2.12 and added new workflows for Scala 2.13, ensuring consistent testing for both Spark and non-Spark modules. - **Documentation** - Enhanced build instructions with updated commands for creating Uber Jars and new automation shortcuts to streamline code formatting, committing, and pushing changes.

## Summary Added pinning support for both our maven and spark repositories so we don't have to resolve them during builds. Going forward whenever we make any updates to the artifacts in either maven or spark repositories, we would need to re-pin the changed repos using following commands and check-in the updated json files. ``` REPIN=1 bazel run @maven//:pin REPIN=1 bazel run @spark//:pin ``` ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Integrated enhanced repository management for Maven and Spark, providing improved dependency installation. - Added support for JSON configuration files for Maven and Spark installations. - **Chores** - Updated documentation to include instructions on pinning Maven artifacts and managing dependency versions effectively.

A VSCode plugin for feature authoring that detects errors and uses data sampling in order to speed up the iteration cycle. The goal is to reduce the amount of memorizing commands, typing / clicking, waiting for clusters to be spun up, and jobs to finish. In this example, we have a complex expression operating on nested data. The eval button appears above Chronon types. When you click on the Eval button, it samples your data, runs your code and shows errors or transformed result within seconds. ![zipline_vscode_plugin](https://github.com/user-attachments/assets/5ac56764-f6e7-4998-b5aa-1f4cabde42f9) ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [x] Integration tested (see above) - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced a new Visual Studio Code extension that enhances Python development. - The extension displays an evaluation button alongside specific assignment statements in Python files, allowing users to trigger evaluation commands directly in the terminal. - Added a command to execute evaluation actions related to Zipline AI configurations. - **Documentation** - Added a new LICENSE file containing the MIT License text. - **Configuration** - Introduced new configuration files for TypeScript and Webpack to support the extension's development and build processes. - **Exclusions** - Updated `.gitignore` and added `.vscodeignore` to streamline version control and packaging processes.

## Summary Moved scala dependencies to separate scala_2_12 and scala_2_13 repositories so we can load the right repo based on config instead of loading both. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit ## Summary by CodeRabbit - **Chores** - Upgraded Scala dependencies to newer versions with updated verification, ensuring improved stability. - Removed outdated package references to streamline dependency management. - Introduced new repository configurations for Scala 2.12 and 2.13 to enhance dependency management. - Added `.gitignore` entry to exclude `node_modules` in the `authoring/vscode` path. - Created `LICENSE` file with MIT License text for the new extension. - **New Features** - Introduced a Visual Studio Code extension with a CodeLens provider for Python files, allowing users to evaluate variables directly in the editor. - **Refactor** - Updated dependency declarations to utilize a new method for handling Scala artifacts, improving consistency across the project.  --------- Co-authored-by: Nikhil Simha <[email protected]>

## Summary Adds AWS build and push commands to the distribution script. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced an automated quickstart process for GCP deployments. - Enhanced the build and upload tool with flexible command-line options, supporting artifact creation for both AWS and GCP environments. - Added a new script for running the Zipline quickstart on GCP. - **Refactor** - Updated the AWS quickstart process to ensure consistent execution.

…FilePath and replacing `/` to `.` in MetaData names (#398) ## Summary ^^^ Tested on the etsy laptop. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Bug Fixes** - Improved error handling to explicitly report when configuration values are missing. - **New Features** - Introduced standardized constants for various configuration types, ensuring consistent key naming. - **Refactor** - Unified metadata processing by using direct metadata names instead of file paths. - Enhanced type safety in configuration options for clearer and more reliable behavior. - **Tests** - Updated test cases and parameters to reflect the improved metadata and configuration handling.

Reverts #373 Passing in options to push to only one customer is broken.  ## Summary by CodeRabbit - **Refactor** - Streamlined the deployment process to automatically build and upload artifacts exclusively to Google Cloud Platform. - Removed configuration options and handling for an alternative cloud provider, resulting in a simpler, more focused workflow.

## Summary building join output schema should belong to metadata store - and also reduces the size of fetcher. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced an optimized caching mechanism for data join operations, resulting in improved performance and reliability. - Added new methods to facilitate the creation and management of join codecs. - **Bug Fixes** - Enhanced error handling for join codec operations, ensuring clearer context for failures. - **Documentation** - Improved code readability and clarity through updated comments and method signatures.

## Summary Add support to run the fetcher service in docker. Also add rails to publish to docker hub as a private image - [ziplineai/chronon-fetcher](https://hub.docker.com/repository/docker/ziplineai/chronon-fetcher) I wasn't able to sort out logback / log4j2 logging as there's a lot of deps messing things up - Vert.x supports JUL configs and that is seemingly working so starting with that for now. Tested with: ``` docker run -v ~/.config/gcloud/application_default_credentials.json:/gcp/credentials.json \ -p 9000:9000 \ -e "GCP_PROJECT_ID=canary-443022" \ -e "GOOGLE_CLOUD_PROJECT=canary-443022" \ -e "GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance" \ -e "STATSD_HOST=127.0.0.1" \ -e GOOGLE_APPLICATION_CREDENTIALS=/gcp/credentials.json \ ziplineai/chronon-fetcher ``` And then you can `curl http://localhost:9000/ping` On Etsy side just need to swap out the project and bt instance id and then can curl the actual join: ``` curl -X POST http://localhost:9000/v1/fetch/join/search.ranking.v1_web_zipline_cdc_and_beacon_external -H 'Content-Type: application/json' -d '[{"listing_id":"632126370","shop_id":"53908089","shipping_profile_id":"235561688531"}]' {"results":[{"status":"Success","entityKeys":{"listing_id":"632126370","shop_id":"53908089","shipping_profile_id":"235561688531"},"features":{... ``` ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Added an automation script that streamlines the container image build and publication process with improved error handling. - Introduced a new container configuration that installs essential dependencies, sets environment variables, and incorporates a health check for enhanced reliability. - Implemented a robust logging setup that standardizes console and file outputs with log rotation. - Provided a startup script for the service that verifies required settings and applies platform-specific options for seamless execution.

## Summary Adds the ability to push artifacts to aws in addition to gcp. Also adds ability to specify specific customer ids to push to. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced a new automation script that streamlines the process of building artifacts and deploying them to both AWS and GCP with improved error handling and user confirmation. - **Chores** - Removed a legacy artifact upload script that previously handled only GCP deployments.

## Summary - Supporting StagingQueries for configurable compute engines. To support BigQuery, the simplest way is to just write bigquery sql and run it on bq to create the final table. Let's first make the API change. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit ## Summary by CodeRabbit - **New Features** - Added an option for users to specify the compute engine when processing queries, offering choices such as Spark and BigQuery. - Introduced validation to ensure that queries run only with the designated engine. - **Style** - Streamlined code organization for enhanced readability. - Consolidated and reordered import statements for improved clarity.   --------- Co-authored-by: Thomas Chow <[email protected]>

## Summary fetcher has grown over time into a large file with many large functions that are hard to work with. This refactoring doesn't change any functionality - just placement. Made some of the scala code more idiomatic - if(try.isFailed) - vs try.recoverWith Made Metadata methods more explicit FetcherBase -> JoinPartFetcher + GroupByFetcher + GroupByResponseHandler Added fetch context - to replace 10 constructor params ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced a unified configuration context that enhances data fetching, including improved group-by and join operations with more robust error handling. - Added a new `FetchContext` class to manage fetching operations and execution contexts. - Implemented a new `GroupByFetcher` class for efficient group-by data retrieval. - **Refactor** - Upgraded serialization and deserialization to use a more efficient, compact protocol. - Standardized API definitions and type declarations across modules to improve clarity and maintainability. - Enhanced error handling in various methods to provide more informative messages. - **Chores** - Removed outdated utilities and reorganized dependency imports. - Updated test suites to align with the refactored architecture.

## Summary - Staging query should in theory already work for external tables without additional code changes as long as we do some setup work to pin up a view first. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  --------- Co-authored-by: Thomas Chow <[email protected]>

## Summary The existing aggregations configure the items sketch incorrectly. Split it into two one that works purely with skewed data, and one that tries to best-effort collect most frequent items. ## Checklist - [x] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced new utility functions to streamline expression composition and cleanup. - Enhanced aggregation descriptions for clearer operation choices. - Added new aggregation types for improved data analysis. - **Refactor** - Revamped frequency analysis logic with improved error handling and optimized sizing. - Replaced legacy histogram approaches with a more robust frequent item detection mechanism. - **Tests** - Added tests to validate heavy hitter detection and skewed data scenarios, while removing obsolete histogram tests. - Updated existing tests to reflect changes in aggregation parameters. - **Chores** - Removed deprecated interactive modules for a leaner deployment. - **Configuration** - Adjusted default aggregation parameters for more consistent processing, including changes to the `k` value in multiple configurations.

## Summary Create workflow to trigger platform subtree pull reusable workflow. Also deletes Push To Canary workflow as it will be triggered in the platform repo. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Chores** - Added a new workflow to automate triggering subtree updates in an external platform repository when changes are pushed to the main branch. - Removed the "Push To Canary" workflow, discontinuing automated artifact builds, canary deployments, integration tests, and related notifications for the main branch.

## Summary ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced advanced planning and orchestration capabilities for offline data processing, including new planners for join and group-by operations. - Added utilities for metadata layering and enriched partition specification handling. - Introduced a structured approach to offline join planning with detailed metadata and node composition. - Added new traits and classes to support batch run contexts and node execution. - Added comprehensive table dependency generation based on joins, group-bys, and sources. - **Improvements** - Expanded partitioning metadata in API definitions for richer temporal semantics. - Updated orchestration schemas with new node types and renamed entities for clarity. - Improved naming conventions by replacing "Keyword" suffixes with "Folder" across configurations. - Streamlined internal logic for table and job naming, dependency resolution, and window operations. - Enhanced error handling and logging in table utilities. - Adjusted snapshot accuracy logic in merge operations for event data models. - Modified tile drift calculation to use a fixed timestamp offset. - **Bug Fixes** - Corrected logic for snapshot accuracy handling in merge operations. - **Refactor** - Centralized utility methods for window arithmetic and partition specification. - Consolidated job context parameters in join part jobs. - Restricted visibility of label join methods for better encapsulation. - Replaced generic bootstrap job classes with join-specific implementations. - Simplified import statements and method signatures for improved clarity. - Delegated left source table name computation to join offline planner. - **Chores** - Updated `.gitignore` to exclude additional directories. - Removed legacy configuration-to-node conversion code and associated dependency resolver tests. - **Documentation** - Improved code comments and formatting for new and existing classes and methods.

## Summary ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Added optional fields for partition format and partition interval to query definitions, allowing greater flexibility in specifying partitioning behavior. - **Refactor** - Simplified partition specification usage across the platform by consolidating partition column, format, and interval into a single object. - Updated multiple interfaces and methods to derive partition column and related metadata from the unified partition specification, reducing explicit parameter passing. - Streamlined class and method signatures to improve consistency and maintainability. - Removed deprecated partition specs and adjusted related logic to use the updated partition specification format. - Enhanced SQL clause generation to internally use partition specification details, removing the need to pass partition column explicitly. - Adjusted data generation and query construction logic to rely on the updated partition specification model. - Simplified construction and usage of partition specifications in data processing and metadata components. - Improved handling of partition specs in Spark-related utilities and jobs for consistency. - **Chores** - Updated tests and internal utilities to align with the new partition specification structure. - Reduced test data volume in join tests to optimize test runtime and resource usage.  --------- Co-authored-by: Thomas Chow <[email protected]>

) ## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update   ## Summary by CodeRabbit - **Refactor** - Simplified test logic for handling partition dates, making tests rely on the expected data's partition date. - Cleaned up and reordered import statements for improved clarity. - **Tests** - Updated test method signatures and calls to streamline date handling in test comparisons.  Co-authored-by: thomaschow <[email protected]>

## Summary ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Documentation** - Updated the README with a concise overview and disclaimer about this repository being a fork of Airbnb’s Chronon. - Highlighted key differences including additional connectors, upgraded libraries, performance improvements, and specialized runners. - Clarified deployment options and maintenance practices. - Removed detailed usage instructions, examples, and conceptual explanations. - Noted that full documentation is forthcoming and invited users to contact maintainers for early access.  --------- Co-authored-by: ezvz <[email protected]>

## Summary Adding an additional partitions argument to table deps Produces dependency that looks like this: `"customJson": "{\"airflow_dependencies\": [{\"name\": \"wf_sample_namespace_sample_table\", \"spec\": \"sample_namespace.sample_table/ds={{ ds }}/_HR=23:00\"}]}",` ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Added support for specifying additional partition information when defining table dependencies, allowing for more flexible and detailed dependency configurations. - **Tests** - Updated test cases to include examples with additional partition specifications in table dependencies.  Co-authored-by: ezvz <[email protected]>

## Summary Add `https://` protocol to open link to `https://github.com/airbnb/chronon` and not `https://github.com/zipline-ai/chronon/blob/main/github.com/airbnb/chronon` ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update   ## Summary by CodeRabbit - **Documentation** - Updated the README to include the "https://" protocol in the GitHub URL for Airbnb's Chronon repository.  Co-authored-by: Sean Lynch <[email protected]>

## Summary My strategy to use a reuseble workflow doesn't work anymore because a private workflow isn't accessible from a public repo. Instead of triggering the sync, this simply runs it from here. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Chores** - Removed the previous canary release workflow, including automated build, test, and artifact deployment steps for AWS and GCP. - Introduced a new workflow to automate synchronization of code from the chronon repository into the platform repository via subtree pull and push operations.

## Summary Add partition_format to python Query object ## Checklist - [x] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Added support for specifying a partition format when using the Query function.

## Summary Add additional sub partitions to wait for in Query to ultimately compile into GroupBy and Join ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [x] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Added support for specifying additional sub-partitions when defining data dependencies, allowing for more granular detection of data arrival. - **Documentation** - Updated function and parameter documentation to reflect new options for specifying sub-partitions in queries.  --------- Co-authored-by: ezvz <[email protected]>

## Summary As we will be releasing from chronon, this change brings back the canary build and testing. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Chores** - Introduced a new automated workflow for continuous integration and deployment to canary environments on AWS and GCP. - Added integration tests and artifact uploads for both platforms, with Slack notifications for build or test failures. - Enhanced artifact tracking with detailed metadata and automated cleanup after deployment.

## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update   ## Summary by CodeRabbit - **Style** - Reorganized import statements for improved readability. - **Chores** - Removed debugging print statements from partition insertion to clean up console output.  Co-authored-by: thomaschow <[email protected]>

## Summary Run push_to_platform on pull request merge only. Also use default message ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Chores** - Updated workflow to run only after a pull request is merged into the main branch, instead of on every push. - Adjusted the commit message behavior for subtree updates to use the default message.

## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Chores** - Removed the synthetic dataset generation script for browser and device fingerprinting data. - Removed related test configurations and documentation for AWS Zipline and Plaid data processing. - Updated AWS release workflow to exclude the "plaid" customer ID from S3 uploads. - Cleaned up commented-out AWS S3 and Glue deletion commands in deployment scripts.   --------- Co-authored-by: thomaschow <[email protected]>

## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Chores** - Removed references to "etsy" as a customer ID from workflows, scripts, and documentation. - Deleted test and configuration files related to "etsy" and sample teams. - Updated Avro schema namespaces and default values from "com.etsy" to "com.customer" and related URLs. - Improved indentation and formatting in sample configuration files. - **Tests** - Updated test arguments and removed obsolete test data related to "etsy".

…-passing-candidate to line up with publish_release (#760) ## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Chores** - Updated storage paths for artifact uploads to cloud storage in deployment workflows. - **Documentation** - Corrected a type annotation in the documentation for a query parameter. - **Tests** - Enhanced a test to include and verify a new query parameter.

coderabbitai · 2025-05-10T04:38:12Z

Walkthrough

The change refactors the scanDfBase method in TableUtils.scala to handle partition column type conversions explicitly. It now applies type-specific transformations or throws an error for unsupported types, instead of always using date_format, before coalescing the DataFrame.

Changes

File	Change Summary
spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala	Refactored `scanDfBase` to inspect partition column type, apply type-specific formatting, and error on unsupported types. No public API changes.
spark/src/test/scala/ai/chronon/spark/test/TableUtilsTest.scala	Added tests verifying partition column handling for string and long types, including data insertion and validation with cleanup.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant TableUtils

    Caller->>TableUtils: scanDfBase(df, partitionCol, config)
    alt Partition column exists
        TableUtils->>TableUtils: Inspect partition column type
        alt DateType/StringType/TimestampType
            TableUtils->>TableUtils: Apply date_format
        else LongType
            TableUtils->>TableUtils: Apply from_unixtime
        else Unsupported type
            TableUtils->>Caller: Throw UnsupportedOperationException
        end
    else Partition column missing
        TableUtils->>TableUtils: Leave DataFrame unchanged
    end
    TableUtils->>TableUtils: Coalesce DataFrame
    TableUtils-->>Caller: Return DataFrame

Possibly related PRs

feat: read and write date types #192: Modifies repartitionAndWriteInternal to convert partition columns to date format before repartitioning, related to partition column date/time handling.

Suggested reviewers

nikhil-zlai
piyush-zlai

Poem

Partition types now checked with care,
No more silent errors lurking there.
Dates, longs, and strings—each gets its due,
Unsupported? We’ll shout at you!
Coalesced and tidy, the DataFrame flows—
Onward, TableUtils, as logic grows!
🗂️✨

Tip

⚡️ Faster reviews with caching

CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.

Enjoy the performance boost—your workflow just got faster.

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

tchow-zlai · 2025-05-10T06:27:06Z

spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala

-     }).coalesce(coalesceFactor * parallelism)
+    val partitionFieldType = df.schema.find(_.name == partitionColumn).map(_.dataType)
+
+    val adjustedDf = partitionFieldType match {


unfortunately now that we use exports the export will just map the timestamps to long type 😂 technically the BQ connector will handle these schema mappings correctly so that we don't have to handle this ourselves. This came about cuz of the new export flow I guess.

yeah, I realized that is what must be going on. bq be crazy

…mapping (#728) ## Summary Updating the JoinSchemaResponse to include a mapping from feature -> listing key. This PR updates our JoinSchemaResponse to include a value info case class with these details. ## Checklist - [X] Added Unit Tests - [X] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit ## Summary by CodeRabbit - **New Features** - Added detailed metadata for join value fields, including feature names, group names, prefixes, left keys, and schema descriptions, now available in join schema responses. - **Bug Fixes** - Improved consistency and validation between join configuration keys and value field metadata. - **Tests** - Enhanced and added tests to validate the presence and correctness of value field metadata in join schema responses. - Introduced new test suites covering fetcher failure scenarios and metadata store functionality. - Refactored existing fetcher tests to use external utility methods for data generation. - Added utility methods for generating deterministic, random, and event-only test data configurations.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

spark/src/test/scala/ai/chronon/spark/test/TableUtilsTest.scala (3)
648-696: Good implementation of long partition column testing.

Test successfully verifies long timestamps are converted to formatted date strings.

Consider adding more test cases for edge dates:
  List(
    Row(1L, 2, 1747068965000L)
+   // Add tests for edge cases like:
+   // Row(1L, 2, 0L),                 // Epoch start
+   // Row(1L, 2, 32503680000000L)     // Year 3000
  )
649-650: Use unique table names between tests.

Both tests use identical table names which could cause issues if cleanup fails.
-val tableName = "db.test_partition_column_types"
+val tableName = "db.test_partition_column_long_types"
665-666: Add comment explaining timestamp value.

Document why this specific timestamp was chosen and what date it represents.
        List(
-         Row(1L, 2, 1747068965000L)
+         Row(1L, 2, 1747068965000L) // 2025-05-12 in milliseconds since epoch
        )

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 9257a91 and c1db037.

📒 Files selected for processing (2)

spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala (3 hunks)
spark/src/test/scala/ai/chronon/spark/test/TableUtilsTest.scala (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala

⏰ Context from checks skipped due to timeout of 90000ms (16)

GitHub Check: streaming_tests
GitHub Check: groupby_tests
GitHub Check: streaming_tests
GitHub Check: analyzer_tests
GitHub Check: analyzer_tests
GitHub Check: join_tests
GitHub Check: join_tests
GitHub Check: batch_tests
GitHub Check: fetcher_tests
GitHub Check: groupby_tests
GitHub Check: spark_tests
GitHub Check: fetcher_tests
GitHub Check: batch_tests
GitHub Check: spark_tests
GitHub Check: scala_compile_fmt_fix
GitHub Check: enforce_triggered_workflows

🔇 Additional comments (1)

spark/src/test/scala/ai/chronon/spark/test/TableUtilsTest.scala (1)

617-646: LGTM! Test verifies string partition columns work correctly.

Basic test ensures string partition columns function properly.

varant-zlai and others added 30 commits February 13, 2025 23:23

refactor: fetcher sub package + kill old stats in fetcher (#423)

c57b03b

chewy-zlai and others added 20 commits May 6, 2025 16:05

Fix risk confs compiling (#745)

54ad5fb

Vz/add test case for different partition formats (#753)

487d4d3

Zlib fix for updated MacOS (#755)

6edb34b

feat: add support for long partition columns

e665ca7

tchow-zlai reviewed May 10, 2025

View reviewed changes

nikhil-zlai and others added 4 commits May 11, 2025 08:18

divide by 1000

9257a91

add tests

1b24589

Merge branch 'main' into nikhil/ts_as_partition

4a78d03

tchow-zlai approved these changes May 12, 2025

View reviewed changes

comment

c1db037

coderabbitai bot reviewed May 12, 2025

View reviewed changes

chewy-zlai force-pushed the main branch from 3114e00 to 4e7b865 Compare May 16, 2025 03:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add support for long partition columns #761

feat: add support for long partition columns #761

Uh oh!

nikhil-zlai commented May 10, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented May 10, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

tchow-zlai May 10, 2025 •

edited

Loading

Uh oh!

nikhil-zlai May 11, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

feat: add support for long partition columns #761

Are you sure you want to change the base?

feat: add support for long partition columns #761

Uh oh!

Conversation

nikhil-zlai commented May 10, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested reviewers

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

tchow-zlai May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikhil-zlai May 11, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nikhil-zlai commented May 10, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented May 10, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

tchow-zlai May 10, 2025 •

edited

Loading