cherry pick oss 888 with column fix #267

varant-zlai · 2025-01-23T19:06:21Z

Summary

Cherry-picking PR: airbnb/chronon@main...earangol-stripe:chronon:earangol--prefix-column-names-select

Checklist

Added Unit Tests
Covered by existing CI
Integration tested
Documentation update

Summary by CodeRabbit

New Features
- Added DataFrame extension methods for padding fields and renaming columns during join operations.
- Introduced a new schema comparison utility for testing.
Improvements
- Streamlined column renaming and join processes across multiple Spark-related classes.
- Enhanced DataFrame manipulation capabilities.
Tests
- Added comprehensive unit tests for new DataFrame extension methods.
- Created a schema comparison utility for testing.

Basic service scaffolding using Scala Play framework

# Conflicts: # build.sbt # hub/app/controllers/FrontendController.scala # hub/app/views/index.scala.html # hub/public/images/favicon.png

Focus on fraud case for demo. Make parquet file as data.parquet

Adds a readme, and keeps the docker container active so the parquet table can be accessed.

…nto docker-init

Create svelte project and build it using play for deployment

## Summary We have a strict dependency on java 11 for all the dataproc stuff so it's good to be consistent across our project. Currently only service_commons package has strict dependency on java 17 so made changes to be compatible with java 11. was able to successfully build using both sbt and bazel. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Configuration** - Downgraded Java language and runtime versions from 17 to 11 in Bazel configuration - **Code Improvement** - Updated type handling in `RouteHandlerWrapper` method signature for enhanced type safety

…fy query plans * Define implicit tableUtils to fix test

## Summary Set up a Flink job that can take beacon data as avro (configured in gcs) and emit it at a configurable rate to Kafka. We can use this stream in our GB streaming jobs ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested Kicked off the job and you can see events flowing in topic [test-beacon-main](https://console.cloud.google.com/managedkafka/us-central1/clusters/zipline-kafka-cluster/topics/test-beacon-main?hl=en&invt=Abnpeg&project=canary-443022) - [ ] Documentation update  ## Summary by CodeRabbit ## Release Notes - **New Features** - Added Kafka data ingestion capabilities using Apache Flink. - Introduced a new driver for streaming events from GCS to Kafka with configurable delay. - **Dependencies** - Added Apache Flink connectors for Kafka, Avro, and file integration. - Integrated managed Kafka authentication handler for cloud environments. - **Infrastructure** - Created new project configuration for Kafka data ingestion. - Updated build settings to support advanced streaming workflows. - Updated cluster name configuration for Dataproc submitter.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

spark/src/main/scala/ai/chronon/spark/Extensions.scala (1)

169-181: ⚠️ Potential issue

Add type validation to prevent mismatches
Existing columns are not checked for schema compatibility, risking runtime errors.

🧹 Nitpick comments (2)

spark/src/test/scala/ai/chronon/spark/test/TestUtils.scala (1)

341-361: Solid comparison logic. Consider nested structs.

spark/src/main/scala/ai/chronon/spark/Extensions.scala (1)

183-198: Consider column name collisions
Renaming might overwrite existing columns if the prefix or mapping conflicts.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between fd1f7ea and d8550b5.

📒 Files selected for processing (6)

spark/src/main/scala/ai/chronon/spark/Extensions.scala (2 hunks)
spark/src/main/scala/ai/chronon/spark/Join.scala (2 hunks)
spark/src/main/scala/ai/chronon/spark/JoinBase.scala (1 hunks)
spark/src/main/scala/ai/chronon/spark/LabelJoin.scala (2 hunks)
spark/src/test/scala/ai/chronon/spark/test/ExtensionsTest.scala (1 hunks)
spark/src/test/scala/ai/chronon/spark/test/TestUtils.scala (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (4)

spark/src/main/scala/ai/chronon/spark/LabelJoin.scala
spark/src/main/scala/ai/chronon/spark/Join.scala
spark/src/test/scala/ai/chronon/spark/test/ExtensionsTest.scala
spark/src/main/scala/ai/chronon/spark/JoinBase.scala

⏰ Context from checks skipped due to timeout of 90000ms (6)

GitHub Check: other_spark_tests
GitHub Check: join_spark_tests
GitHub Check: fetcher_spark_tests
GitHub Check: mutation_spark_tests
GitHub Check: table_utils_delta_format_spark_tests
GitHub Check: scala_compile_fmt_fix

🔇 Additional comments (2)

spark/src/test/scala/ai/chronon/spark/test/TestUtils.scala (1)

30-30: Good alias usage.

spark/src/main/scala/ai/chronon/spark/Extensions.scala (1)

22-22: Import Looks Fine
This import seems necessary for joinPart usage.

## Summary - the way tables are created can be different depending on the underlying storage catalog that we are using. In the case of BigQuery, we cannot issue a spark sql statement to do this operation. Let's abstract that out into the `Format` layer for now, but eventually we will need a `Catalog` abstraction that supports this. - Try to remove the dependency on a sparksession in `Format`, invert the dependency with a HOF ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Enhanced BigQuery integration with new client parameter - Added support for more flexible table creation methods - Improved logging capabilities for table operations - **Bug Fixes** - Standardized table creation process across different formats - Removed unsupported BigQuery table creation operations - **Refactor** - Simplified table creation and partition insertion logic - Updated method signatures to support more comprehensive table management   --------- Co-authored-by: Thomas Chow <[email protected]>

## Summary - https://app.asana.com/0/1208949807589885/1209206040434612/f - Support explicit bigquery table creation. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update --- - To see the specific tasks where the Asana app for GitHub is being used, see below: - https://app.asana.com/0/0/1209206040434612  ## Summary by CodeRabbit - **New Features** - Enhanced BigQuery table creation functionality with improved schema and partitioning support. - Streamlined table creation process in Spark's TableUtils. - **Refactor** - Simplified table existence checking logic. - Consolidated import statements for better readability. - Removed unused import in BigQuery catalog test. - Updated import statement in GcpFormatProviderTest for better integration with Spark BigQuery connector. - **Bug Fixes** - Improved error handling for table creation scenarios.   --------- Co-authored-by: Thomas Chow <[email protected]>

…obs (#274) ## Summary Changes to avoid runtime errors while running Spark and Flume jobs on the cluster using Dataproc submit for the newly built bazel assembly jars. Successfully ran the DataprocSubmitterTest jobs using the newly built bazel jars for testing. Testing Steps 1. Build the assembly jars ``` bazel build //cloud_gcp:lib_deploy.jar bazel build //flink:assembly_deploy.jar bazel build //flink:kafka-assembly_deploy.jar ``` 2. Copy the jars to gcp account used by our jobs ``` gsutil cp bazel-bin/cloud_gcp/lib_deploy.jar gs://zipline-jars/bazel-cloud-gcp.jar gsutil cp bazel-bin/flink/assembly_deploy.jar gs://zipline-jars/bazel-flink.jar gsutil cp bazel-bin/flink/kafka-assembly_deploy.jar gs://zipline-jars/bazel-flink-kafka.jar ``` 3. Modify the jar paths in the DataprocSubmitterTest file and run the tests ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit ## Release Notes - **New Features** - Added Kafka and Avro support for Flink. - Introduced new Flink assembly and Kafka assembly binaries. - **Dependency Management** - Updated Maven artifact dependencies, including Kafka and Avro. - Excluded specific Hadoop and Spark-related artifacts to prevent runtime conflicts. - **Build Configuration** - Enhanced build rules for Flink and Spark environments. - Improved dependency management to prevent class compatibility issues.

## Summary The string formatting here was breaking zipline commands when I built a new wheel: ``` File "/Users/davidhan/zipline/chronon/dev_chronon/lib/python3.11/site-packages/ai/chronon/repo/hub_uploader.py", line 73 print(f"\n\nUploading:\n {"\n".join(diffed_entities.keys())}") ^ SyntaxError: unexpected character after line continuation character ``` ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Style** - Improved string formatting and error message construction for better code readability - Enhanced log message clarity in upload process Note: These changes are internal improvements that do not impact end-user functionality.

## Summary - With #263 we control table creation ourselves. We don't need to rely on indirect writes to then do the table creation (and partitioning) for us, we just simply use the storage API to write directly into the table we created. This should be much more performant and preferred over indirect writes because we don't need to stage data, then load as a temp BQ table, and it uses the BigQuery storage API directly. - Remove configs that are used only for indirect writes ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit ## Release Notes - **Improvements** - Enhanced BigQuery data writing process with more precise configuration options. - Simplified table creation and partition insertion logic. - Improved handling of DataFrame column arrangements during data operations. - **Changes** - Updated BigQuery write method to use a direct writing approach. - Introduced a new option to prevent table creation if it does not exist. - Modified table creation process to be more format-aware. - Streamlined partition insertion mechanism. These updates improve data management and writing efficiency in cloud data processing workflows.   --------- Co-authored-by: Thomas Chow <[email protected]>

## Summary - Pull alter table properties functionality into `Format` ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update   ## Summary by CodeRabbit - **New Features** - Added a placeholder method for altering table properties in BigQuery format - Introduced a new method to modify table properties across different Spark formats - Enhanced table creation utility to use format-specific property alteration methods - **Refactor** - Improved table creation process by abstracting table property modifications - Standardized approach to handling table property changes across different formats  --------- Co-authored-by: Thomas Chow <[email protected]>

## Summary ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Added configurable repartitioning option for DataFrame writes. - Introduced a new configuration setting to control repartitioning behavior. - Enhanced test suite with functionality to handle empty DataFrames. - **Chores** - Improved code formatting and logging for DataFrame writing process.   --------- Co-authored-by: tchow-zlai <[email protected]> Co-authored-by: Thomas Chow <[email protected]>

) ## Summary we cannot represent absent data in time series with null values because the thrift serializer doesn't allow nulls in list<double> - so we create a magic double value (based on string "chronon") to represent nulls ## Checklist - [x] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit ## Release Notes - **New Features** - Introduced a new constant `magicNullDouble` for handling null value representations - **Bug Fixes** - Improved null value handling in serialization and data processing workflows - **Tests** - Added new test case for `TileDriftSeries` serialization to validate null value management The changes enhance data processing consistency and provide a standardized approach to managing null values across the system.

## Summary This fixes an issue where Infinity/NaN values in drift calculations were causing JSON parse errors in the frontend. These special values are now converted to our standard magic null value (-2.7980863399423856E16) before serialization. ## Checklist - [x] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Bug Fixes** - Enhanced error handling for special double values (infinity and NaN) in data processing - Improved serialization test case to handle null and special values more robustly - **Tests** - Updated test case for `TileDriftSeries` serialization to cover edge cases with special double values

## Summary Changed return type to seq (backed by ArrayBuffer under the hood). Added unit tests. ## Checklist - [x] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Refactor** - Updated SQL utility methods to return sequences of results instead of optional single results. - Modified result handling in SQL-related utility classes to support multiple result rows. - Enhanced logic to handle cases with no valid inputs in data processing methods. - **Tests** - Updated test cases to accommodate new result handling approach. - Added new test for handling "explode" invocations in SQL select clauses. - Refined handling of null values and special floating-point values in tests. - **Bug Fixes** - Improved error handling and result processing in SQL transformation methods.

## Summary fixed null handling in drift test code. ## Checklist - [ ] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Bug Fixes** - Improved null handling in drift series processing to prevent potential runtime exceptions - Enhanced error resilience by safely checking for null values during data aggregation

) ## Summary - branch protection requires explicitly adding github actions that need to succeed before a user can merge a PR. Some github workflows do not qualify to run for a PR based on path filtering etc. Skipped workflows will still be required to "succeed" for a branch protection rule. This PR adds a blanked workflow that will poll for workflows explicitly triggered to succeed instead. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Chores** - Added a new GitHub Actions workflow to enforce status checks on pull requests. - Reformatted test file for improved code readability without changing functionality.   --------- Co-authored-by: Thomas Chow <[email protected]>

## Summary Plumbing through a setting that allows us to specify a checkpoint or a savepoint location to resume a job. When we plug into the CLI we'll ask users to manually specify it if needed, when we set up the orchestrator it can be triggered as part of the deployment process - trigger savepoint, deploy new build using savepoint, .. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested - Tested on the cluster and confirmed setting is picked up - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Enhanced Flink job submission with optional savepoint URI support. - Added ability to retain checkpoints when the job is canceled. - **Improvements** - Updated job submission configuration to support more flexible checkpoint management. - Introduced new constant for savepoint URI in the job submission process. - **Technical Updates** - Modified checkpoint configuration to preserve checkpoints on job cancellation. - Updated test cases to reflect new job submission parameters.

## Summary Manual cherry-pick from https://github.com/airbnb/chronon/pull/909/files Should be rebased and tested off of #276 ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update - [x] Not yet tested, should test after merging 276  ## Summary by CodeRabbit - **New Features** - Added a builder pattern for `JavaFetcher` to enhance object creation flexibility. - Introduced an optional `ExecutionContext` parameter across multiple classes to improve execution context management. - **Improvements** - Updated constructor signatures in Java and Scala classes to support more configurable execution contexts. - Enhanced instantiation process with more structured object creation methods.  --------- Co-authored-by: ezvz <[email protected]> Co-authored-by: Nikhil Simha <[email protected]> Co-authored-by: Thomas Chow <[email protected]>

piyush-zlai and others added 30 commits September 24, 2024 15:11

Rename webservice -> hub

8bd1a21

Merge pull request #18 from zipline-ai/piyush--websvc

ba49291

Basic service scaffolding using Scala Play framework

move frontend .gitignore stuff to the base of the repo

b0e7848

Merge branch 'main' into ken-websvc

bc86daf

# Conflicts: # build.sbt # hub/app/controllers/FrontendController.scala # hub/app/views/index.scala.html # hub/public/images/favicon.png

make sure to use hub instead of webservice name

32de32b

use npm ci instead of npm install

d198c47

WIP

866f6aa

Focus on fraud case for demo. Make parquet file as data.parquet

Docker Init Sample Anomalous Data

c819ca3

Adds a readme, and keeps the docker container active so the parquet table can be accessed.

Coderabbit changes

55cee36

Move python script to a startup script so fresh data is generated.

9a643ca

Switch to amazoncorretto

81171af

Clean up Dockerfile

33fbd26

Merge branch 'main' into docker-init

12566c3

More coderabbit changes.

e76f183

Merge branch 'docker-init' of https://github.com/zipline-ai/chronon i…

e31f288

…nto docker-init

Add security features to spark

22b9139

Merge pull request #19 from zipline-ai/ken-websvc

65a5e10

Create svelte project and build it using play for deployment

Implement stubbed model list, search and time series backend APIs

adfc201

Reorder routes post rebase

5446c70

Scala fix

6e88890

Drop max metric support for now

41814b8

Add paginate trait

0ceac3d

Add mock data service

2662117

Refactor expected hours

3efebc6

Use regex for offset

5cb671d

Refactor doFetchModel for conciseness

75d12a7

Refactor doFetchJoinDrift for conciseness

9fbb935

Refactor doFetchFeatureDrift for conciseness

a4c479e

Scalafix

099c9e9

Check is right in tests

bdb5136

kumar-zlai and others added 10 commits January 23, 2025 13:30

Replace multiple calls to withColumn with single select to simpli…

189bc0b

…fy query plans * Define implicit tableUtils to fix test

scalafmt

0144b34

Delete unused method and add docs

954c73c

scalafmt

9e7765a

WIP

99fe2f0

WIP

891b3e0

WIP

502fa67

WIP

e9dd863

varant-zlai force-pushed the vz--cherry_pick_oss_888_withColumn_fix branch from fd1f7ea to e9dd863 Compare January 23, 2025 23:50

Merge branch 'main' into vz--cherry_pick_oss_888_withColumn_fix

d8550b5

coderabbitai bot reviewed Jan 23, 2025

View reviewed changes

tchow-zlai and others added 15 commits January 24, 2025 15:05

Merge branch 'main' into vz--cherry_pick_oss_888_withColumn_fix

0dc2d70

kumar-zlai closed this May 1, 2025

kumar-zlai force-pushed the main branch from a969d44 to e6f8822 Compare May 1, 2025 05:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cherry pick oss 888 with column fix #267

cherry pick oss 888 with column fix #267

varant-zlai commented Jan 23, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot left a comment

cherry pick oss 888 with column fix #267

cherry pick oss 888 with column fix #267

Conversation

varant-zlai commented Jan 23, 2025 • edited by coderabbitai bot Loading

Summary

Checklist

Summary by CodeRabbit

coderabbitai bot left a comment

Choose a reason for hiding this comment

varant-zlai commented Jan 23, 2025 •

edited by coderabbitai bot

Loading