-
Notifications
You must be signed in to change notification settings - Fork 0
feat: add support for long partition columns #761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
## Summary Allow setting partition column name in sources. Maps it to the default partition name upon read and partition checking. ## Checklist - [x] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enabled configurable partition columns in query, join, and data generation operations for improved data partitioning. - **Refactor** - Streamlined partition handling and consolidated import structures to enhance workflow efficiency. - **Tests** - Added test cases for verifying partition column functionality and adjusted data generation volumes for better validation. - Introduced new tests specifically for different partition columns to ensure accurate handling of partitioned data. These enhancements provide increased flexibility and accuracy in managing partitioned datasets during data processing and join operations. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: ezvz <[email protected]> Co-authored-by: Nikhil Simha <[email protected]>
## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced a new query to retrieve purchase records with date range filtering. - Enhanced data retrieval by including additional contextual metadata for improved insights. <!-- end of auto-generated comment: release notes by coderabbit.ai --> <!-- av pr metadata This information is embedded by the av CLI when creating PRs to track the status of stacks when using Aviator. Please do not delete or edit this section of the PR. ``` {"parent":"main","parentHead":"","trunk":"main"} ``` --> --------- Co-authored-by: Thomas Chow <[email protected]>
## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced dedicated testing workflows covering multiple system components to enhance overall reliability. - Added new test suites for various components to enhance testing granularity. - **Refactor** - Streamlined code organization with improved package structures and consolidated imports across test modules. - **Chores** - Upgraded automated testing configurations with optimized resource settings for improved performance and stability. <!-- end of auto-generated comment: release notes by coderabbit.ai --> <!-- av pr metadata This information is embedded by the av CLI when creating PRs to track the status of stacks when using Aviator. Please do not delete or edit this section of the PR. ``` {"parent":"main","parentHead":"","trunk":"main"} ``` --> --------- Co-authored-by: Thomas Chow <[email protected]>
## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- av pr metadata This information is embedded by the av CLI when creating PRs to track the status of stacks when using Aviator. Please do not delete or edit this section of the PR. ``` {"parent":"main","parentHead":"","trunk":"main"} ``` --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Chores** - Adjusted the test execution timeout setting from a longer duration to 900 seconds to ensure tests complete more promptly. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Thomas Chow <[email protected]>
…nd not pushed to remote (#385) ## Summary I've been seeing that it's difficult to track what changes went into artifacts we push to etsy and canary. Especially when it comes to tracking performance regressions for spark jobs one day to the next. Adding a check to not allow any pushes to any customer artifacts if the branch is dirty. All changes need to at least be pushed to remote. And adding a metadata tag of the commit and branch ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced consistency checks during the build and upload process to verify that local changes are committed and branches are in sync. - Enhanced artifact metadata now includes additional context about the code state at the time of upload. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
… top (#380) ## Summary While trying to read the updated beacon top topic we hit issues as the number of avro fields is greater than the Spark codegen limit default of 100. Thanks to this the wholestage codegen code is incorrect and we either end up with segfaults (unit tests) or garbled events (prod flink jobs). This PR bumps the limit to allow us to read beacon top (374 fields) as well as adds an assert in Catalyst util's whole stage code gen code to fail if we encounter this again in the future for a higher number of fields than our current bumped limit. ## Checklist - [X] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enhanced data processing robustness with improved handling and early error detection for large schemas. - Refined SQL query formatting for clearer logical conditions. - **Tests** - Added a new validation for large schema deserialization. - Updated test definitions to improve structure and readability. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary - Make the thrift gen python executable, use `py_binary` to support python generally ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Chores** - Enhanced the build process for a key automation tool by streamlining its execution and command handling, leading to improved overall build reliability and performance. - Transitioned the export mechanism of a Python script to a defined executable binary target for better integration within the build system. <!-- end of auto-generated comment: release notes by coderabbit.ai --> <!-- av pr metadata This information is embedded by the av CLI when creating PRs to track the status of stacks when using Aviator. Please do not delete or edit this section of the PR. ``` {"parent":"main","parentHead":"","trunk":"main"} ``` --> Co-authored-by: Thomas Chow <[email protected]>
## Summary - Release Notes: https://spark.apache.org/releases/spark-release-3-5-4.html - https://issues.apache.org/jira/browse/SPARK-49791 is a good one for us. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- av pr metadata This information is embedded by the av CLI when creating PRs to track the status of stacks when using Aviator. Please do not delete or edit this section of the PR. ``` {"parent":"main","parentHead":"","trunk":"main"} ``` --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Chores** - Upgraded underlying Apache Spark libraries to version 3.5.4, delivering enhanced performance, stability, and compatibility. This update improves processing efficiency and backend reliability, ensuring smoother and more secure data operations. End-users may notice more robust and responsive interactions as a result of these improvements, further enhancing overall system performance. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Thomas Chow <[email protected]>
## Summary - Even though I'm eager to get ahead here, let's not go too crazy and accidentally shoot ourselves in the foot. Let's stay pinned to what our clusters have (3.5.1) until those upgrade. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- av pr metadata This information is embedded by the av CLI when creating PRs to track the status of stacks when using Aviator. Please do not delete or edit this section of the PR. ``` {"parent":"main","parentHead":"","trunk":"main"} ``` --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Chores** - Updated core Spark libraries—impacting SQL, Hive, Streaming, and Avro features—to version 3.5.1 to ensure enhanced stability and improved integration across Spark-powered functionalities. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Grant and I were chatting about the high number of hosts needed for the beacon top Flink jobs (24). This is because the topic parallelism is 96 and we squeeze 4 slots per TM (so 96 / 4 = 24 hosts). Given that folks often over provision Kafka topics in terms of partitions, going with a default of scaling down by 1/4th. Will look into wiring up Flink autoscaling as a follow up to not have this hardcoded. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Refactor** - Optimized stream processing by refining the parallelism calculation. The system now applies a scaling factor to better adjust the number of active processing units, which may result in improved efficiency under certain conditions. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Documentation** - Clarified command instructions and added informational notes to set expectations during initial builds. - **New Features** - Introduced new build options for modular construction of components, including dedicated commands for hub and cloud modules. - Added an automated script to streamline the frontend build process. - **Chores** - Updated container setup and startup processes to utilize revised deployment artifacts. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary - trim down tableutils - add iceberg runtime dependency to cloud_gcp ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Added a runtime dependency to enhance Spark processing. - Introduced a consolidated method for computing partition ranges. - **Refactor** - Streamlined import sections and simplified join analysis by removing redundant permission checks. - **Bug Fixes** - Removed methods related to table permission checks, impacting access control functionality. - **Tests** - Removed an outdated test for table permission verification. - **Chores** - Updated the project’s dependency configuration to include the new Spark runtime artifact. <!-- end of auto-generated comment: release notes by coderabbit.ai --> <!-- av pr metadata This information is embedded by the av CLI when creating PRs to track the status of stacks when using Aviator. Please do not delete or edit this section of the PR. ``` {"parent":"main","parentHead":"","trunk":"main"} ``` --> --------- Co-authored-by: Thomas Chow <[email protected]>
## Summary Changed the backend code to only compute 3 percentiles (p5, p50, p95) for returning to the frontend. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Bug Fixes** - Enhanced statistical data processing to consistently handle cases with missing values by using a robust placeholder, ensuring clearer downstream analytics. - Adjusted the percentile chart configuration so that the 95th, 50th, and 5th percentiles are accurately rendered, providing more reliable insights for users. - Relaxed the null ratio validation in summary data, allowing for a broader acceptance of null values, which may affect drift metric interpretations. - **New Features** - Introduced methods for converting percentile strings to index values and filtering percentiles based on user-defined requests, improving data handling and representation. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Changes to support builds/tests with both scala 2.12 and 2.13 versions. By default we build against 2.12 version, pass "--config scala_2.13" option to "bazel build/test" to override it. ScalaFmt seems to be breaking for 2.13 using bazel rules_scala package, [fix](bazel-contrib/rules_scala#1631) is already deployed but a release with that change is not available yet, so temporarily disabled ScalaFmt checks for 2.13 will enable later once the fix is released. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enabled flexible Scala version selection (2.12 and 2.13) for smoother builds and enhanced compatibility. - Introduced a default Scala version constant and a repository rule for improved version management. - Added support for additional Scala 2.13 dependencies in the build configuration. - **Refactor and Improvements** - Streamlined build and dependency management for increased stability and performance. - Consolidated collection conversion utilities to boost reliability in tests and runtime processing. - Enhanced type safety and clarity in collection handling across various modules. - Improved handling of Scala collections and maps throughout the codebase for better type consistency and safety. - Updated method implementations to ensure explicit type conversions, enhancing clarity and preventing runtime errors. - Modified method signatures and internal logic to utilize `Seq` for improved type clarity and consistency. - Enhanced the `maven_artifact` function to accept an optional version parameter for better dependency management. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary - #381 introduced the ability to configure a partition column at the node-level. This PR simply fixes a missed spot on the plumbing of the new StagingQuery attribute. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enhanced the query builder to support specifying a partition column, providing greater customization for query formation and partitioning. - **Improvements** - Improved handling of partition columns by introducing a fallback mechanism to ensure valid values are used when necessary. <!-- end of auto-generated comment: release notes by coderabbit.ai --> <!-- av pr metadata This information is embedded by the av CLI when creating PRs to track the status of stacks when using Aviator. Please do not delete or edit this section of the PR. ``` {"parent":"main","parentHead":"","trunk":"main"} ``` --> --------- Co-authored-by: Thomas Chow <[email protected]>
## Summary To add CI checks for making sure we are able to build and test all modules on both scala 2.12 and 2.13 versions. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Chores** - Updated automated testing workflows to support Scala 2.12 and added new workflows for Scala 2.13, ensuring consistent testing for both Spark and non-Spark modules. - **Documentation** - Enhanced build instructions with updated commands for creating Uber Jars and new automation shortcuts to streamline code formatting, committing, and pushing changes. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Added pinning support for both our maven and spark repositories so we don't have to resolve them during builds. Going forward whenever we make any updates to the artifacts in either maven or spark repositories, we would need to re-pin the changed repos using following commands and check-in the updated json files. ``` REPIN=1 bazel run @maven//:pin REPIN=1 bazel run @spark//:pin ``` ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Integrated enhanced repository management for Maven and Spark, providing improved dependency installation. - Added support for JSON configuration files for Maven and Spark installations. - **Chores** - Updated documentation to include instructions on pinning Maven artifacts and managing dependency versions effectively. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
A VSCode plugin for feature authoring that detects errors and uses data sampling in order to speed up the iteration cycle. The goal is to reduce the amount of memorizing commands, typing / clicking, waiting for clusters to be spun up, and jobs to finish. In this example, we have a complex expression operating on nested data. The eval button appears above Chronon types. When you click on the Eval button, it samples your data, runs your code and shows errors or transformed result within seconds.  ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [x] Integration tested (see above) - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced a new Visual Studio Code extension that enhances Python development. - The extension displays an evaluation button alongside specific assignment statements in Python files, allowing users to trigger evaluation commands directly in the terminal. - Added a command to execute evaluation actions related to Zipline AI configurations. - **Documentation** - Added a new LICENSE file containing the MIT License text. - **Configuration** - Introduced new configuration files for TypeScript and Webpack to support the extension's development and build processes. - **Exclusions** - Updated `.gitignore` and added `.vscodeignore` to streamline version control and packaging processes. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Moved scala dependencies to separate scala_2_12 and scala_2_13 repositories so we can load the right repo based on config instead of loading both. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Summary by CodeRabbit - **Chores** - Upgraded Scala dependencies to newer versions with updated verification, ensuring improved stability. - Removed outdated package references to streamline dependency management. - Introduced new repository configurations for Scala 2.12 and 2.13 to enhance dependency management. - Added `.gitignore` entry to exclude `node_modules` in the `authoring/vscode` path. - Created `LICENSE` file with MIT License text for the new extension. - **New Features** - Introduced a Visual Studio Code extension with a CodeLens provider for Python files, allowing users to evaluate variables directly in the editor. - **Refactor** - Updated dependency declarations to utilize a new method for handling Scala artifacts, improving consistency across the project. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Nikhil Simha <[email protected]>
## Summary Adds AWS build and push commands to the distribution script. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced an automated quickstart process for GCP deployments. - Enhanced the build and upload tool with flexible command-line options, supporting artifact creation for both AWS and GCP environments. - Added a new script for running the Zipline quickstart on GCP. - **Refactor** - Updated the AWS quickstart process to ensure consistent execution. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…FilePath and replacing `/` to `.` in MetaData names (#398) ## Summary ^^^ Tested on the etsy laptop. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Bug Fixes** - Improved error handling to explicitly report when configuration values are missing. - **New Features** - Introduced standardized constants for various configuration types, ensuring consistent key naming. - **Refactor** - Unified metadata processing by using direct metadata names instead of file paths. - Enhanced type safety in configuration options for clearer and more reliable behavior. - **Tests** - Updated test cases and parameters to reflect the improved metadata and configuration handling. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Reverts #373 Passing in options to push to only one customer is broken. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Refactor** - Streamlined the deployment process to automatically build and upload artifacts exclusively to Google Cloud Platform. - Removed configuration options and handling for an alternative cloud provider, resulting in a simpler, more focused workflow. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary building join output schema should belong to metadata store - and also reduces the size of fetcher. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced an optimized caching mechanism for data join operations, resulting in improved performance and reliability. - Added new methods to facilitate the creation and management of join codecs. - **Bug Fixes** - Enhanced error handling for join codec operations, ensuring clearer context for failures. - **Documentation** - Improved code readability and clarity through updated comments and method signatures. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Add support to run the fetcher service in docker. Also add rails to publish to docker hub as a private image - [ziplineai/chronon-fetcher](https://hub.docker.com/repository/docker/ziplineai/chronon-fetcher) I wasn't able to sort out logback / log4j2 logging as there's a lot of deps messing things up - Vert.x supports JUL configs and that is seemingly working so starting with that for now. Tested with: ``` docker run -v ~/.config/gcloud/application_default_credentials.json:/gcp/credentials.json \ -p 9000:9000 \ -e "GCP_PROJECT_ID=canary-443022" \ -e "GOOGLE_CLOUD_PROJECT=canary-443022" \ -e "GCP_BIGTABLE_INSTANCE_ID=zipline-canary-instance" \ -e "STATSD_HOST=127.0.0.1" \ -e GOOGLE_APPLICATION_CREDENTIALS=/gcp/credentials.json \ ziplineai/chronon-fetcher ``` And then you can `curl http://localhost:9000/ping` On Etsy side just need to swap out the project and bt instance id and then can curl the actual join: ``` curl -X POST http://localhost:9000/v1/fetch/join/search.ranking.v1_web_zipline_cdc_and_beacon_external -H 'Content-Type: application/json' -d '[{"listing_id":"632126370","shop_id":"53908089","shipping_profile_id":"235561688531"}]' {"results":[{"status":"Success","entityKeys":{"listing_id":"632126370","shop_id":"53908089","shipping_profile_id":"235561688531"},"features":{... ``` ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Added an automation script that streamlines the container image build and publication process with improved error handling. - Introduced a new container configuration that installs essential dependencies, sets environment variables, and incorporates a health check for enhanced reliability. - Implemented a robust logging setup that standardizes console and file outputs with log rotation. - Provided a startup script for the service that verifies required settings and applies platform-specific options for seamless execution. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Adds the ability to push artifacts to aws in addition to gcp. Also adds ability to specify specific customer ids to push to. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced a new automation script that streamlines the process of building artifacts and deploying them to both AWS and GCP with improved error handling and user confirmation. - **Chores** - Removed a legacy artifact upload script that previously handled only GCP deployments. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary - Supporting StagingQueries for configurable compute engines. To support BigQuery, the simplest way is to just write bigquery sql and run it on bq to create the final table. Let's first make the API change. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Summary by CodeRabbit - **New Features** - Added an option for users to specify the compute engine when processing queries, offering choices such as Spark and BigQuery. - Introduced validation to ensure that queries run only with the designated engine. - **Style** - Streamlined code organization for enhanced readability. - Consolidated and reordered import statements for improved clarity. <!-- end of auto-generated comment: release notes by coderabbit.ai --> <!-- av pr metadata This information is embedded by the av CLI when creating PRs to track the status of stacks when using Aviator. Please do not delete or edit this section of the PR. ``` {"parent":"main","parentHead":"","trunk":"main"} ``` --> --------- Co-authored-by: Thomas Chow <[email protected]>
## Summary fetcher has grown over time into a large file with many large functions that are hard to work with. This refactoring doesn't change any functionality - just placement. Made some of the scala code more idiomatic - if(try.isFailed) - vs try.recoverWith Made Metadata methods more explicit FetcherBase -> JoinPartFetcher + GroupByFetcher + GroupByResponseHandler Added fetch context - to replace 10 constructor params ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced a unified configuration context that enhances data fetching, including improved group-by and join operations with more robust error handling. - Added a new `FetchContext` class to manage fetching operations and execution contexts. - Implemented a new `GroupByFetcher` class for efficient group-by data retrieval. - **Refactor** - Upgraded serialization and deserialization to use a more efficient, compact protocol. - Standardized API definitions and type declarations across modules to improve clarity and maintainability. - Enhanced error handling in various methods to provide more informative messages. - **Chores** - Removed outdated utilities and reorganized dependency imports. - Updated test suites to align with the refactored architecture. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary - Staging query should in theory already work for external tables without additional code changes as long as we do some setup work to pin up a view first. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- av pr metadata This information is embedded by the av CLI when creating PRs to track the status of stacks when using Aviator. Please do not delete or edit this section of the PR. ``` {"parent":"main","parentHead":"","trunk":"main"} ``` --> --------- Co-authored-by: Thomas Chow <[email protected]>
## Summary The existing aggregations configure the items sketch incorrectly. Split it into two one that works purely with skewed data, and one that tries to best-effort collect most frequent items. ## Checklist - [x] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced new utility functions to streamline expression composition and cleanup. - Enhanced aggregation descriptions for clearer operation choices. - Added new aggregation types for improved data analysis. - **Refactor** - Revamped frequency analysis logic with improved error handling and optimized sizing. - Replaced legacy histogram approaches with a more robust frequent item detection mechanism. - **Tests** - Added tests to validate heavy hitter detection and skewed data scenarios, while removing obsolete histogram tests. - Updated existing tests to reflect changes in aggregation parameters. - **Chores** - Removed deprecated interactive modules for a leaner deployment. - **Configuration** - Adjusted default aggregation parameters for more consistent processing, including changes to the `k` value in multiple configurations. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Create workflow to trigger platform subtree pull reusable workflow. Also deletes Push To Canary workflow as it will be triggered in the platform repo. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Chores** - Added a new workflow to automate triggering subtree updates in an external platform repository when changes are pushed to the main branch. - Removed the "Push To Canary" workflow, discontinuing automated artifact builds, canary deployments, integration tests, and related notifications for the main branch. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced advanced planning and orchestration capabilities for offline data processing, including new planners for join and group-by operations. - Added utilities for metadata layering and enriched partition specification handling. - Introduced a structured approach to offline join planning with detailed metadata and node composition. - Added new traits and classes to support batch run contexts and node execution. - Added comprehensive table dependency generation based on joins, group-bys, and sources. - **Improvements** - Expanded partitioning metadata in API definitions for richer temporal semantics. - Updated orchestration schemas with new node types and renamed entities for clarity. - Improved naming conventions by replacing "Keyword" suffixes with "Folder" across configurations. - Streamlined internal logic for table and job naming, dependency resolution, and window operations. - Enhanced error handling and logging in table utilities. - Adjusted snapshot accuracy logic in merge operations for event data models. - Modified tile drift calculation to use a fixed timestamp offset. - **Bug Fixes** - Corrected logic for snapshot accuracy handling in merge operations. - **Refactor** - Centralized utility methods for window arithmetic and partition specification. - Consolidated job context parameters in join part jobs. - Restricted visibility of label join methods for better encapsulation. - Replaced generic bootstrap job classes with join-specific implementations. - Simplified import statements and method signatures for improved clarity. - Delegated left source table name computation to join offline planner. - **Chores** - Updated `.gitignore` to exclude additional directories. - Removed legacy configuration-to-node conversion code and associated dependency resolver tests. - **Documentation** - Improved code comments and formatting for new and existing classes and methods. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Added optional fields for partition format and partition interval to query definitions, allowing greater flexibility in specifying partitioning behavior. - **Refactor** - Simplified partition specification usage across the platform by consolidating partition column, format, and interval into a single object. - Updated multiple interfaces and methods to derive partition column and related metadata from the unified partition specification, reducing explicit parameter passing. - Streamlined class and method signatures to improve consistency and maintainability. - Removed deprecated partition specs and adjusted related logic to use the updated partition specification format. - Enhanced SQL clause generation to internally use partition specification details, removing the need to pass partition column explicitly. - Adjusted data generation and query construction logic to rely on the updated partition specification model. - Simplified construction and usage of partition specifications in data processing and metadata components. - Improved handling of partition specs in Spark-related utilities and jobs for consistency. - **Chores** - Updated tests and internal utilities to align with the new partition specification structure. - Reduced test data volume in join tests to optimize test runtime and resource usage. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Thomas Chow <[email protected]>
) ## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- av pr metadata This information is embedded by the av CLI when creating PRs to track the status of stacks when using Aviator. Please do not delete or edit this section of the PR. ``` {"parent":"main","parentHead":"","trunk":"main"} ``` --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Refactor** - Simplified test logic for handling partition dates, making tests rely on the expected data's partition date. - Cleaned up and reordered import statements for improved clarity. - **Tests** - Updated test method signatures and calls to streamline date handling in test comparisons. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: thomaschow <[email protected]>
## Summary ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Documentation** - Updated the README with a concise overview and disclaimer about this repository being a fork of Airbnb’s Chronon. - Highlighted key differences including additional connectors, upgraded libraries, performance improvements, and specialized runners. - Clarified deployment options and maintenance practices. - Removed detailed usage instructions, examples, and conceptual explanations. - Noted that full documentation is forthcoming and invited users to contact maintainers for early access. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: ezvz <[email protected]>
## Summary Adding an additional partitions argument to table deps Produces dependency that looks like this: `"customJson": "{\"airflow_dependencies\": [{\"name\": \"wf_sample_namespace_sample_table\", \"spec\": \"sample_namespace.sample_table/ds={{ ds }}/_HR=23:00\"}]}",` ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Added support for specifying additional partition information when defining table dependencies, allowing for more flexible and detailed dependency configurations. - **Tests** - Updated test cases to include examples with additional partition specifications in table dependencies. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: ezvz <[email protected]>
## Summary Add `https://` protocol to open link to `https://github.com/airbnb/chronon` and not `https://github.com/zipline-ai/chronon/blob/main/github.com/airbnb/chronon` ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- av pr metadata This information is embedded by the av CLI when creating PRs to track the status of stacks when using Aviator. Please do not delete or edit this section of the PR. ``` {"parent":"main","parentHead":"","trunk":"main"} ``` --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Documentation** - Updated the README to include the "https://" protocol in the GitHub URL for Airbnb's Chronon repository. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Sean Lynch <[email protected]>
## Summary My strategy to use a reuseble workflow doesn't work anymore because a private workflow isn't accessible from a public repo. Instead of triggering the sync, this simply runs it from here. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Chores** - Removed the previous canary release workflow, including automated build, test, and artifact deployment steps for AWS and GCP. - Introduced a new workflow to automate synchronization of code from the chronon repository into the platform repository via subtree pull and push operations. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Add partition_format to python Query object ## Checklist - [x] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Added support for specifying a partition format when using the Query function. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Add additional sub partitions to wait for in Query to ultimately compile into GroupBy and Join ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [x] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Added support for specifying additional sub-partitions when defining data dependencies, allowing for more granular detection of data arrival. - **Documentation** - Updated function and parameter documentation to reflect new options for specifying sub-partitions in queries. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: ezvz <[email protected]>
## Summary As we will be releasing from chronon, this change brings back the canary build and testing. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Chores** - Introduced a new automated workflow for continuous integration and deployment to canary environments on AWS and GCP. - Added integration tests and artifact uploads for both platforms, with Slack notifications for build or test failures. - Enhanced artifact tracking with detailed metadata and automated cleanup after deployment. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- av pr metadata This information is embedded by the av CLI when creating PRs to track the status of stacks when using Aviator. Please do not delete or edit this section of the PR. ``` {"parent":"main","parentHead":"","trunk":"main"} ``` --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Style** - Reorganized import statements for improved readability. - **Chores** - Removed debugging print statements from partition insertion to clean up console output. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: thomaschow <[email protected]>
## Summary Run push_to_platform on pull request merge only. Also use default message ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Chores** - Updated workflow to run only after a pull request is merged into the main branch, instead of on every push. - Adjusted the commit message behavior for subtree updates to use the default message. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Chores** - Removed the synthetic dataset generation script for browser and device fingerprinting data. - Removed related test configurations and documentation for AWS Zipline and Plaid data processing. - Updated AWS release workflow to exclude the "plaid" customer ID from S3 uploads. - Cleaned up commented-out AWS S3 and Glue deletion commands in deployment scripts. <!-- end of auto-generated comment: release notes by coderabbit.ai --> <!-- av pr metadata This information is embedded by the av CLI when creating PRs to track the status of stacks when using Aviator. Please do not delete or edit this section of the PR. ``` {"parent":"main","parentHead":"","trunk":"main"} ``` --> --------- Co-authored-by: thomaschow <[email protected]>
## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Chores** - Removed references to "etsy" as a customer ID from workflows, scripts, and documentation. - Deleted test and configuration files related to "etsy" and sample teams. - Updated Avro schema namespaces and default values from "com.etsy" to "com.customer" and related URLs. - Improved indentation and formatting in sample configuration files. - **Tests** - Updated test arguments and removed obsolete test data related to "etsy". <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…-passing-candidate to line up with publish_release (#760) ## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Chores** - Updated storage paths for artifact uploads to cloud storage in deployment workflows. - **Documentation** - Corrected a type annotation in the documentation for a query parameter. - **Tests** - Enhanced a test to include and verify a new query parameter. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
WalkthroughThe change refactors the Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant TableUtils
Caller->>TableUtils: scanDfBase(df, partitionCol, config)
alt Partition column exists
TableUtils->>TableUtils: Inspect partition column type
alt DateType/StringType/TimestampType
TableUtils->>TableUtils: Apply date_format
else LongType
TableUtils->>TableUtils: Apply from_unixtime
else Unsupported type
TableUtils->>Caller: Throw UnsupportedOperationException
end
else Partition column missing
TableUtils->>TableUtils: Leave DataFrame unchanged
end
TableUtils->>TableUtils: Coalesce DataFrame
TableUtils-->>Caller: Return DataFrame
Possibly related PRs
Suggested reviewers
Poem
Tip ⚡️ Faster reviews with caching
Enjoy the performance boost—your workflow just got faster. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
}).coalesce(coalesceFactor * parallelism) | ||
val partitionFieldType = df.schema.find(_.name == partitionColumn).map(_.dataType) | ||
|
||
val adjustedDf = partitionFieldType match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately now that we use exports the export will just map the timestamps to long type 😂 technically the BQ connector will handle these schema mappings correctly so that we don't have to handle this ourselves. This came about cuz of the new export flow I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I realized that is what must be going on. bq be crazy
…mapping (#728) ## Summary Updating the JoinSchemaResponse to include a mapping from feature -> listing key. This PR updates our JoinSchemaResponse to include a value info case class with these details. ## Checklist - [X] Added Unit Tests - [X] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Summary by CodeRabbit - **New Features** - Added detailed metadata for join value fields, including feature names, group names, prefixes, left keys, and schema descriptions, now available in join schema responses. - **Bug Fixes** - Improved consistency and validation between join configuration keys and value field metadata. - **Tests** - Enhanced and added tests to validate the presence and correctness of value field metadata in join schema responses. - Introduced new test suites covering fetcher failure scenarios and metadata store functionality. - Refactored existing fetcher tests to use external utility methods for data generation. - Added utility methods for generating deterministic, random, and event-only test data configurations. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
spark/src/test/scala/ai/chronon/spark/test/TableUtilsTest.scala (3)
648-696
: Good implementation of long partition column testing.Test successfully verifies long timestamps are converted to formatted date strings.
Consider adding more test cases for edge dates:
List( Row(1L, 2, 1747068965000L) + // Add tests for edge cases like: + // Row(1L, 2, 0L), // Epoch start + // Row(1L, 2, 32503680000000L) // Year 3000 )
649-650
: Use unique table names between tests.Both tests use identical table names which could cause issues if cleanup fails.
-val tableName = "db.test_partition_column_types" +val tableName = "db.test_partition_column_long_types"
665-666
: Add comment explaining timestamp value.Document why this specific timestamp was chosen and what date it represents.
List( - Row(1L, 2, 1747068965000L) + Row(1L, 2, 1747068965000L) // 2025-05-12 in milliseconds since epoch )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (2)
spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala
(3 hunks)spark/src/test/scala/ai/chronon/spark/test/TableUtilsTest.scala
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala
⏰ Context from checks skipped due to timeout of 90000ms (16)
- GitHub Check: streaming_tests
- GitHub Check: groupby_tests
- GitHub Check: streaming_tests
- GitHub Check: analyzer_tests
- GitHub Check: analyzer_tests
- GitHub Check: join_tests
- GitHub Check: join_tests
- GitHub Check: batch_tests
- GitHub Check: fetcher_tests
- GitHub Check: groupby_tests
- GitHub Check: spark_tests
- GitHub Check: fetcher_tests
- GitHub Check: batch_tests
- GitHub Check: spark_tests
- GitHub Check: scala_compile_fmt_fix
- GitHub Check: enforce_triggered_workflows
🔇 Additional comments (1)
spark/src/test/scala/ai/chronon/spark/test/TableUtilsTest.scala (1)
617-646
: LGTM! Test verifies string partition columns work correctly.Basic test ensures string partition columns function properly.
Summary
Checklist
Summary by CodeRabbit