Skip to content

Branch versioning logic #163

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jan 10, 2025
Merged

Branch versioning logic #163

merged 12 commits into from
Jan 10, 2025

Conversation

nikhil-zlai
Copy link
Contributor

@nikhil-zlai nikhil-zlai commented Dec 28, 2024

Summary

Writing down the cases & logic behind branching. Will continue writing about how these are internally represented.

Checklist

  • Added Unit Tests
  • Covered by existing CI
  • Integration tested
  • Documentation update

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Corrected a typo in TableDependency struct, renaming forceComputae to forceCompute.
  • New Features

    • Added support for branch-based workflows in data orchestration.
    • Introduced RepoIndex for managing repository data and versions.
    • Created SequenceMap utility for managing unique value sequences.
    • Added TablePrinter for formatted data output.
    • Introduced VersionUpdate class for tracking version changes.
    • Added StringExtensions for MD5 hashing of strings.
  • Refactoring

    • Removed LineageIndex and LogicalSet classes.
    • Updated naming conventions for logical nodes.
    • Restructured repository parsing and version tracking mechanisms.
  • Documentation

    • Added comprehensive README for branch workflow management.
    • Expanded documentation for global planning and batch workload processing.
  • Chores

    • Updated logging dependencies.
    • Modified build configuration.
    • Added new test specifications.

Copy link

coderabbitai bot commented Dec 28, 2024

Walkthrough

This pull request introduces corrections and enhancements to the Chronon orchestration system. Key changes include fixing a typographical error in the orchestration.thrift file, the addition of documentation regarding branch support in the README.md, and the introduction of new classes such as RepoIndex, RepoNodes, and VersionUpdate. Additionally, several legacy classes were removed, and naming conventions for logical nodes were standardized.

Changes

File Change Summary
api/thrift/orchestration.thrift Fixed typo: Renamed forceComputae to forceCompute
orchestration/README.md Added documentation on branch support and pipeline management
orchestration/src/main/scala/ai/chronon/orchestration/LineageIndex.scala Removed LineageIndex class and related functionality
orchestration/src/main/scala/ai/chronon/orchestration/LogicalSet.scala Removed LogicalSet class
orchestration/src/main/scala/ai/chronon/orchestration/RepoIndex.scala New implementation for repository data management
orchestration/src/main/scala/ai/chronon/orchestration/RepoNodes.scala New class for managing repository nodes
orchestration/src/main/scala/ai/chronon/orchestration/logical/*NodeImpl.scala Updated name methods with namespace prefixes
orchestration/src/main/scala/ai/chronon/orchestration/RepoParser.scala Restructured methods for file handling and hash computation
orchestration/src/main/scala/ai/chronon/orchestration/VersionUpdate.scala New class for tracking version updates
orchestration/src/main/scala/ai/chronon/orchestration/utils/* Introduced utility classes for string extensions and sequence mapping
build.sbt Updated logging dependencies and resource configurations
orchestration/src/test/scala/ai/chronon/orchestration/test/*Spec.scala Added tests for RepoIndex and SequenceMap

Possibly related PRs

Suggested reviewers

  • piyush-zlai

Poem

🌟 Code's symphony, a refactored dance,
Nodes renamed with elegant stance,
Repositories bloom, versions take flight,
Orchestration's magic shines so bright! 🎵

Warning

Review ran into problems

🔥 Problems

GitHub Actions: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository.

Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 9c58f34 and de72d1f.

📒 Files selected for processing (1)
  • build.sbt (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • build.sbt
⏰ Context from checks skipped due to timeout of 90000ms (7)
  • GitHub Check: join_spark_tests
  • GitHub Check: other_spark_tests
  • GitHub Check: table_utils_delta_format_spark_tests
  • GitHub Check: fetcher_spark_tests
  • GitHub Check: scala_compile_fmt_fix
  • GitHub Check: mutation_spark_tests
  • GitHub Check: no_spark_scala_tests

Finishing Touches

  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
orchestration/README.md (1)

14-16: Fix abbreviations in examples

Use proper abbreviations: "e.g.," instead of "eg.," and add period after "etc."

🧰 Tools
🪛 LanguageTool

[uncategorized] ~14-~14: The abbreviation “e.g.” (= for example) requires two periods.
Context: ...tic updates - that change output data - eg., logic within where-s, selects, aggrega...

(E_G)


[style] ~14-~14: In American English, abbreviations like “etc.” require a period.
Context: ...c within where-s, selects, aggregations etc - make non-semantic updates - that don'...

(ETC_PERIOD)


[uncategorized] ~15-~15: The abbreviation “e.g.” (= for example) requires two periods.
Context: ...dates - that don't change output data - eg., spark exec memory, # of pods / workers...

(E_G)


[style] ~15-~15: In American English, abbreviations like “etc.” require a period.
Context: ... spark exec memory, # of pods / workers etc - *note that everything that is store...

(ETC_PERIOD)


[uncategorized] ~16-~16: Did you mean the plural noun “APIs” instead of the possessive?
Context: ... is stored in metadata field within our API's are non-semantic fields* - add new node...

(APOS_ARE)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 6d3b6ca and 76d15ee.

📒 Files selected for processing (2)
  • api/thrift/orchestration.thrift (1 hunks)
  • orchestration/README.md (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • api/thrift/orchestration.thrift
🧰 Additional context used
🪛 LanguageTool
orchestration/README.md

[uncategorized] ~14-~14: The abbreviation “e.g.” (= for example) requires two periods.
Context: ...tic updates - that change output data - eg., logic within where-s, selects, aggrega...

(E_G)


[style] ~14-~14: In American English, abbreviations like “etc.” require a period.
Context: ...c within where-s, selects, aggregations etc - make non-semantic updates - that don'...

(ETC_PERIOD)


[uncategorized] ~15-~15: The abbreviation “e.g.” (= for example) requires two periods.
Context: ...dates - that don't change output data - eg., spark exec memory, # of pods / workers...

(E_G)


[style] ~15-~15: In American English, abbreviations like “etc.” require a period.
Context: ... spark exec memory, # of pods / workers etc - *note that everything that is store...

(ETC_PERIOD)


[uncategorized] ~16-~16: Did you mean the plural noun “APIs” instead of the possessive?
Context: ... is stored in metadata field within our API's are non-semantic fields* - add new node...

(APOS_ARE)


[style] ~19-~19: Consider shortening or rephrasing this to strengthen your wording.
Context: ...add new nodes - delete existing nodes - make changes to several compute nodes at once - decide ...

(MAKE_CHANGES)


[uncategorized] ~36-~36: The abbreviation “e.g.” (= for example) requires two periods.
Context: ...anges to node notated using a plus "+". Eg., Join J3 becomes J3+ Non-Semantic ...

(E_G)


[grammar] ~85-~85: It seems that a verb or a comma is missing.
Context: ...ory of the user is behind remote, > we will a lot more changes than the user intend...

(MD_DT_JJ)


[style] ~87-~87: This adverb was used twice in the sentence. Consider removing one of them or replacing them with a synonym.
Context: ... mitigate this is to, only make the CLI only pick up changes to files listed as edit...

(ADVERB_REPETITION_PREMIUM)


[uncategorized] ~90-~90: Possible missing article found.
Context: ... branch. Another approach is to force user to rebase on any change to the repo. Ho...

(AI_HYDRA_LEO_MISSING_THE)


[grammar] ~93-~93: This expression is usually spelled with a hyphen.
Context: ...e job is running is accounted for. ### Non Semantic updates Instead, if sq1 changes non-s...

(NON_ANTI_JJ)


[grammar] ~115-~115: The verb ‘compute’ does not usually follow articles like ‘the’. Check that ‘compute’ is spelled correctly; using ‘compute’ as a noun may be non-standard.
Context: ...s who haven't updated sq1 are causing the compute, we should use sq1 - if another branc...

(A_INFINITIVE)


[style] ~116-~116: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...ng sq1 non semantically to sq1** we need to use that instead. ### Adding new node...

(REP_NEED_TO_VB)


[uncategorized] ~167-~167: Possible missing article found.
Context: ...his is not a leaf node, absorbing it as same as a leaf node change would be the rig...

(AI_HYDRA_LEO_MISSING_THE)


[uncategorized] ~191-~191: The adjective or adverb “straightforward” (= easy, precise) is spelled as one word.
Context: ... existing nodes Deleting leaf nodes is straight forward. We just need to program a cleanup mech...

(STRAIGHT_FORWARD_COMPOUND)


[formatting] ~234-~234: Consider inserting a comma after an introductory phrase for better readability.
Context: ... node - or the compilation would fail. In these cases it would be ideal to garbage collect up...

(IN_THAT_CASE_COMMA)


[uncategorized] ~234-~234: Possible missing article found.
Context: ...es it would be ideal to garbage collect upstream chain of the deleted node. ```mermaid ...

(AI_HYDRA_LEO_MISSING_THE)


[typographical] ~260-~260: Consider inserting a comma for improved readability.
Context: ...e development on a branch is in progress we need to create temporary data assets fo...

(MISSING_COMMAS)

🪛 Markdownlint (0.37.0)
orchestration/README.md

28-28: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (5)
orchestration/README.md (5)

84-92: Elaborate on synchronization solutions

The synchronization approaches need more detail:

  • How will the CLI detect file changes from git?
  • What happens if remote changes occur during job execution?

Would you like me to help draft a more detailed synchronization strategy?

🧰 Tools
🪛 LanguageTool

[grammar] ~85-~85: It seems that a verb or a comma is missing.
Context: ...ory of the user is behind remote, > we will a lot more changes than the user intend...

(MD_DT_JJ)


[style] ~87-~87: This adverb was used twice in the sentence. Consider removing one of them or replacing them with a synonym.
Context: ... mitigate this is to, only make the CLI only pick up changes to files listed as edit...

(ADVERB_REPETITION_PREMIUM)


[uncategorized] ~90-~90: Possible missing article found.
Context: ... branch. Another approach is to force user to rebase on any change to the repo. Ho...

(AI_HYDRA_LEO_MISSING_THE)


113-117: Clarify version conflict resolution

Add details about:

  • Priority order when multiple branches modify the same node
  • Conflict resolution strategy
🧰 Tools
🪛 LanguageTool

[grammar] ~115-~115: The verb ‘compute’ does not usually follow articles like ‘the’. Check that ‘compute’ is spelled correctly; using ‘compute’ as a noun may be non-standard.
Context: ...s who haven't updated sq1 are causing the compute, we should use sq1 - if another branc...

(A_INFINITIVE)


[style] ~116-~116: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...ng sq1 non semantically to sq1** we need to use that instead. ### Adding new node...

(REP_NEED_TO_VB)


166-169: Expand on SQL migration scenario

Explain:

  • Why is this case treated differently?
  • What are the implications for data consistency?
🧰 Tools
🪛 LanguageTool

[uncategorized] ~167-~167: Possible missing article found.
Context: ...his is not a leaf node, absorbing it as same as a leaf node change would be the rig...

(AI_HYDRA_LEO_MISSING_THE)


191-193: Detail cleanup mechanism

Document:

  • Cleanup timing and triggers
  • Data retention policy
  • Pipeline removal process
🧰 Tools
🪛 LanguageTool

[uncategorized] ~191-~191: The adjective or adverb “straightforward” (= easy, precise) is spelled as one word.
Context: ... existing nodes Deleting leaf nodes is straight forward. We just need to program a cleanup mech...

(STRAIGHT_FORWARD_COMPOUND)


273-278: Expand merge documentation

Add critical details about:

  • Merge conflict resolution
  • Data migration process
  • Validation steps

Would you like me to help draft a comprehensive merge process guide?

@nikhil-zlai nikhil-zlai changed the title [wip] branch logic write up [WIP] Branch versioning logic Jan 4, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (9)
orchestration/src/main/scala/ai/chronon/orchestration/utils/SequenceMap.scala (3)

5-8: Doc is brief. Maybe clarify thread safety.


9-23: In-memory maps not thread-safe.


51-59: Potential index for missing keys could surprise callers.

orchestration/src/test/scala/ai/chronon/orchestration/test/SequenceMapSpec.scala (1)

7-134: Unit tests are thorough. Consider concurrency tests if needed.

orchestration/src/main/scala/ai/chronon/orchestration/RepoParser.scala (2)

20-25: Ensure resource closure on exception.
Consider a try-finally or using loanPattern for safer file reads.


43-50: Implementation is fine.
Could benefit from unit tests verifying partial/empty names.

orchestration/src/main/scala/ai/chronon/orchestration/RepoIndex.scala (1)

20-104: Method too large.
Consider splitting into smaller functions for clarity.

orchestration/src/test/scala/ai/chronon/orchestration/test/RepoIndexSpec.scala (1)

19-35: Processor logic is partial

Implement parse(...) or document why it's unimplemented.

orchestration/README.md (1)

1-290: Minor grammar & styling suggestions

• Use “e.g.” (with periods) and “etc.” in American English.
• “straightforward” is one word.
Consider small refinements to ensure clarity.

🧰 Tools
🪛 LanguageTool

[uncategorized] ~14-~14: The abbreviation “e.g.” (= for example) requires two periods.
Context: ...tic updates - that change output data - eg., logic within where-s, selects, aggrega...

(E_G)


[style] ~14-~14: In American English, abbreviations like “etc.” require a period.
Context: ...c within where-s, selects, aggregations etc - make non-semantic updates - that don'...

(ETC_PERIOD)


[uncategorized] ~15-~15: The abbreviation “e.g.” (= for example) requires two periods.
Context: ...dates - that don't change output data - eg., spark exec memory, # of pods / workers...

(E_G)


[style] ~15-~15: In American English, abbreviations like “etc.” require a period.
Context: ... spark exec memory, # of pods / workers etc - *note that everything that is store...

(ETC_PERIOD)


[uncategorized] ~16-~16: Did you mean the plural noun “APIs” instead of the possessive?
Context: ... is stored in metadata field within our API's are non-semantic fields* - add new node...

(APOS_ARE)


[style] ~19-~19: Consider shortening or rephrasing this to strengthen your wording.
Context: ...add new nodes - delete existing nodes - make changes to several compute nodes at once - decide ...

(MAKE_CHANGES)


[uncategorized] ~36-~36: The abbreviation “e.g.” (= for example) requires two periods.
Context: ...anges to node notated using a plus "+". Eg., Join J3 becomes J3+ Non-Semantic ...

(E_G)


[grammar] ~85-~85: It seems that a verb or a comma is missing.
Context: ...ory of the user is behind remote, > we will a lot more changes than the user intend...

(MD_DT_JJ)


[style] ~87-~87: This adverb was used twice in the sentence. Consider removing one of them or replacing them with a synonym.
Context: ... mitigate this is to, only make the CLI only pick up changes to files listed as edit...

(ADVERB_REPETITION_PREMIUM)


[grammar] ~93-~93: This expression is usually spelled with a hyphen.
Context: ...e job is running is accounted for. ### Non Semantic updates Instead, if sq1 changes non-s...

(NON_ANTI_JJ)


[grammar] ~115-~115: The verb ‘compute’ does not usually follow articles like ‘the’. Check that ‘compute’ is spelled correctly; using ‘compute’ as a noun may be non-standard.
Context: ...s who haven't updated sq1 are causing the compute, we should use sq1 - if another branc...

(A_INFINITIVE)


[style] ~116-~116: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...ng sq1 non semantically to sq1** we need to use that instead. ### Adding new node...

(REP_NEED_TO_VB)


[uncategorized] ~141-~141: Possible missing article found.
Context: ...olor:black,stroke:#333 ``` But adding non-leaf node - as parent to existing node...

(AI_HYDRA_LEO_MISSING_A)


[uncategorized] ~142-~142: Possible missing article found.
Context: ...ut adding non-leaf node - as parent to existing node - would almost always cause semant...

(AI_HYDRA_LEO_MISSING_AN)


[uncategorized] ~167-~167: Possible missing article found.
Context: ...his is not a leaf node, absorbing it as same as a leaf node change would be the rig...

(AI_HYDRA_LEO_MISSING_THE)


[uncategorized] ~191-~191: The adjective or adverb “straightforward” (= easy, precise) is spelled as one word.
Context: ... existing nodes Deleting leaf nodes is straight forward. We just need to program a cleanup mech...

(STRAIGHT_FORWARD_COMPOUND)


[formatting] ~234-~234: Consider inserting a comma after an introductory phrase for better readability.
Context: ... node - or the compilation would fail. In these cases it would be ideal to garbage collect up...

(IN_THAT_CASE_COMMA)


[typographical] ~260-~260: Consider inserting a comma for improved readability.
Context: ...e development on a branch is in progress we need to create temporary data assets fo...

(MISSING_COMMAS)

🪛 Markdownlint (0.37.0)

28-28: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


273-273: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 76d15ee and ad0c7c4.

📒 Files selected for processing (18)
  • orchestration/README.md (1 hunks)
  • orchestration/src/main/scala/ai/chronon/orchestration/LineageIndex.scala (0 hunks)
  • orchestration/src/main/scala/ai/chronon/orchestration/LogicalSet.scala (0 hunks)
  • orchestration/src/main/scala/ai/chronon/orchestration/README.md (0 hunks)
  • orchestration/src/main/scala/ai/chronon/orchestration/RepoIndex.scala (1 hunks)
  • orchestration/src/main/scala/ai/chronon/orchestration/RepoNodes.scala (1 hunks)
  • orchestration/src/main/scala/ai/chronon/orchestration/RepoParser.scala (2 hunks)
  • orchestration/src/main/scala/ai/chronon/orchestration/VersionUpdate.scala (1 hunks)
  • orchestration/src/main/scala/ai/chronon/orchestration/logical/GroupByNodeImpl.scala (1 hunks)
  • orchestration/src/main/scala/ai/chronon/orchestration/logical/JoinNodeImpl.scala (1 hunks)
  • orchestration/src/main/scala/ai/chronon/orchestration/logical/LogicalNodeImpl.scala (1 hunks)
  • orchestration/src/main/scala/ai/chronon/orchestration/logical/ModelNodeImpl.scala (1 hunks)
  • orchestration/src/main/scala/ai/chronon/orchestration/logical/StagingQueryNodeImpl.scala (1 hunks)
  • orchestration/src/main/scala/ai/chronon/orchestration/logical/TabularDataNodeImpl.scala (1 hunks)
  • orchestration/src/main/scala/ai/chronon/orchestration/utils/SequenceMap.scala (1 hunks)
  • orchestration/src/main/scala/ai/chronon/orchestration/utils/TablePrinter.scala (1 hunks)
  • orchestration/src/test/scala/ai/chronon/orchestration/test/RepoIndexSpec.scala (1 hunks)
  • orchestration/src/test/scala/ai/chronon/orchestration/test/SequenceMapSpec.scala (1 hunks)
💤 Files with no reviewable changes (3)
  • orchestration/src/main/scala/ai/chronon/orchestration/README.md
  • orchestration/src/main/scala/ai/chronon/orchestration/LineageIndex.scala
  • orchestration/src/main/scala/ai/chronon/orchestration/LogicalSet.scala
🧰 Additional context used
🪛 LanguageTool
orchestration/README.md

[uncategorized] ~14-~14: The abbreviation “e.g.” (= for example) requires two periods.
Context: ...tic updates - that change output data - eg., logic within where-s, selects, aggrega...

(E_G)


[style] ~14-~14: In American English, abbreviations like “etc.” require a period.
Context: ...c within where-s, selects, aggregations etc - make non-semantic updates - that don'...

(ETC_PERIOD)


[uncategorized] ~15-~15: The abbreviation “e.g.” (= for example) requires two periods.
Context: ...dates - that don't change output data - eg., spark exec memory, # of pods / workers...

(E_G)


[style] ~15-~15: In American English, abbreviations like “etc.” require a period.
Context: ... spark exec memory, # of pods / workers etc - *note that everything that is store...

(ETC_PERIOD)


[uncategorized] ~16-~16: Did you mean the plural noun “APIs” instead of the possessive?
Context: ... is stored in metadata field within our API's are non-semantic fields* - add new node...

(APOS_ARE)


[style] ~19-~19: Consider shortening or rephrasing this to strengthen your wording.
Context: ...add new nodes - delete existing nodes - make changes to several compute nodes at once - decide ...

(MAKE_CHANGES)


[uncategorized] ~36-~36: The abbreviation “e.g.” (= for example) requires two periods.
Context: ...anges to node notated using a plus "+". Eg., Join J3 becomes J3+ Non-Semantic ...

(E_G)


[grammar] ~85-~85: It seems that a verb or a comma is missing.
Context: ...ory of the user is behind remote, > we will a lot more changes than the user intend...

(MD_DT_JJ)


[style] ~87-~87: This adverb was used twice in the sentence. Consider removing one of them or replacing them with a synonym.
Context: ... mitigate this is to, only make the CLI only pick up changes to files listed as edit...

(ADVERB_REPETITION_PREMIUM)


[grammar] ~93-~93: This expression is usually spelled with a hyphen.
Context: ...e job is running is accounted for. ### Non Semantic updates Instead, if sq1 changes non-s...

(NON_ANTI_JJ)


[grammar] ~115-~115: The verb ‘compute’ does not usually follow articles like ‘the’. Check that ‘compute’ is spelled correctly; using ‘compute’ as a noun may be non-standard.
Context: ...s who haven't updated sq1 are causing the compute, we should use sq1 - if another branc...

(A_INFINITIVE)


[style] ~116-~116: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...ng sq1 non semantically to sq1** we need to use that instead. ### Adding new node...

(REP_NEED_TO_VB)


[uncategorized] ~141-~141: Possible missing article found.
Context: ...olor:black,stroke:#333 ``` But adding non-leaf node - as parent to existing node...

(AI_HYDRA_LEO_MISSING_A)


[uncategorized] ~142-~142: Possible missing article found.
Context: ...ut adding non-leaf node - as parent to existing node - would almost always cause semant...

(AI_HYDRA_LEO_MISSING_AN)


[uncategorized] ~167-~167: Possible missing article found.
Context: ...his is not a leaf node, absorbing it as same as a leaf node change would be the rig...

(AI_HYDRA_LEO_MISSING_THE)


[uncategorized] ~191-~191: The adjective or adverb “straightforward” (= easy, precise) is spelled as one word.
Context: ... existing nodes Deleting leaf nodes is straight forward. We just need to program a cleanup mech...

(STRAIGHT_FORWARD_COMPOUND)


[formatting] ~234-~234: Consider inserting a comma after an introductory phrase for better readability.
Context: ... node - or the compilation would fail. In these cases it would be ideal to garbage collect up...

(IN_THAT_CASE_COMMA)


[typographical] ~260-~260: Consider inserting a comma for improved readability.
Context: ...e development on a branch is in progress we need to create temporary data assets fo...

(MISSING_COMMAS)

🪛 Markdownlint (0.37.0)
orchestration/README.md

28-28: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


273-273: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (36)
orchestration/src/main/scala/ai/chronon/orchestration/utils/SequenceMap.scala (4)

1-4: Looks good.


27-39: Insert logic is clear. Null checks are solid.


41-49: Containment check is straightforward.


61-70: Preconditions guard against invalid access well.

orchestration/src/test/scala/ai/chronon/orchestration/test/SequenceMapSpec.scala (1)

1-6: Test imports and setup look fine.

orchestration/src/main/scala/ai/chronon/orchestration/RepoParser.scala (4)

4-4: Looks good.


8-8: No issues.


27-40: Check error handling for invalid directories.
If compileRoot is missing or incorrect, consider handling it gracefully.


52-52: No changes needed.

orchestration/src/main/scala/ai/chronon/orchestration/RepoIndex.scala (9)

3-9: Imports look fine.


10-10: New class definition is ok.


12-15: Evaluate concurrency.
If used in parallel, consider thread-safe structures.


16-19: Sequencer usage is straightforward.


106-122: Implementation is clear.
Check for potential performance issues with large incomingFileHashes.


124-129: Short and clean.


131-150: Good content-pruning logic.
Be mindful if data is reintroduced.


152-160: File-to-nodes mapping is neat.


164-184: RepoIndex object looks fine.
Helper methods are concise.

orchestration/src/main/scala/ai/chronon/orchestration/logical/LogicalNodeImpl.scala (1)

8-8: Consistent new requirement for naming.

orchestration/src/main/scala/ai/chronon/orchestration/logical/TabularDataNodeImpl.scala (1)

8-8: Prefix usage is consistent.

orchestration/src/main/scala/ai/chronon/orchestration/logical/ModelNodeImpl.scala (1)

12-12: Good prefix alignment with other nodes.

orchestration/src/main/scala/ai/chronon/orchestration/RepoNodes.scala (3)

7-10: Clean and clear data structure.


12-16: Simple composition.


18-24: Neat operator methods.

orchestration/src/main/scala/ai/chronon/orchestration/logical/StagingQueryNodeImpl.scala (1)

12-12: Consistent naming prefix.

orchestration/src/main/scala/ai/chronon/orchestration/logical/GroupByNodeImpl.scala (1)

12-12: Naming alignment looks good.

orchestration/src/main/scala/ai/chronon/orchestration/VersionUpdate.scala (3)

8-19: Concise representation.


23-36: Efficient set manipulation.


38-44: Clear printing logic.

orchestration/src/main/scala/ai/chronon/orchestration/logical/JoinNodeImpl.scala (1)

13-13: Consistent naming prefix is helpful

Works fine. Good consistent naming convention with the prefix "joins."

orchestration/src/main/scala/ai/chronon/orchestration/utils/TablePrinter.scala (3)

3-30: Looks solid & straightforward

Clear printing logic. Calculation of column widths and usage of format strings is concise.


32-35: Typed printer is handy

Encourages reuse with implicit mapping. Nicely done.


38-63: Sample usage

Nice demonstration of the table printer usage.

orchestration/src/test/scala/ai/chronon/orchestration/test/RepoIndexSpec.scala (3)

11-18: TestConf nicely structured

Your case class is simple and effective for test data.


39-52: Example driver is good

Demonstrates usage with minimal overhead.


54-116: Tests appear thorough

They comprehensively cover branch updates. Great coverage.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
orchestration/src/main/scala/ai/chronon/orchestration/RepoIndex.scala (1)

50-78: Watch for deep recursion.
Consider cycle detection or limit depth.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between ad0c7c4 and 4beeee3.

📒 Files selected for processing (3)
  • build.sbt (1 hunks)
  • orchestration/src/main/scala/ai/chronon/orchestration/RepoIndex.scala (1 hunks)
  • orchestration/src/main/scala/ai/chronon/orchestration/RepoParser.scala (1 hunks)
🔇 Additional comments (15)
orchestration/src/main/scala/ai/chronon/orchestration/RepoIndex.scala (8)

11-34: Check concurrency.
Using mutable maps might cause issues if used in parallel. Otherwise looks fine.


36-45: Content retrieval logic is concise.
Looks good.


81-108: Versioning approach seems fine.
Integration with versionSequencer is clear.


110-125: Mismatch detection is straightforward.
Works as intended.


127-133: Branch pruning is fine.
Removes branch reference and triggers cleanup.


134-153: Content cleanup logic is neat.
Filters valid hashes across branches.


155-163: Integration of parse + addNodes is clean.
No obvious issues.


169-185: Helper classes and aliases are well-organized.
Implementation looks solid.

build.sbt (2)

402-403: Resource directory addition is correct.
No problem.


411-414: Dependency updates look fine.
Check that log4j config is safely managed.

orchestration/src/main/scala/ai/chronon/orchestration/RepoParser.scala (5)

3-3: Imports are fine.
No concerns.


10-10: Extending App is convenient.
Keeps it simple.


12-17: Resource closure is good.
Frees file resources properly.


19-32: Hash generation is clear.
Minimal overhead.


35-44: Maps names to contents well.
Reads files succinctly.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (6)
orchestration/src/test/scala/ai/chronon/orchestration/test/RepoIndexSpec.scala (1)

54-157: Add specific assertions for version updates

The test verifies basic functionality but lacks assertions for specific version update behaviors.

Add assertions to verify:

  • Version increments
  • Branch-specific version states
  • Revert behaviors
orchestration/src/main/scala/ai/chronon/orchestration/RepoIndex.scala (5)

11-20: Consider thread safety implications

Multiple mutable collections without synchronization could lead to race conditions in concurrent scenarios.

Consider using synchronized collections or actor-based state management.


21-114: Split method into smaller functions

Method handles too many responsibilities. Consider extracting:

  • Node content validation
  • Hash computation
  • Version update logic

71-71: Consider using debug level for hash computation logs

Production logs should use appropriate log levels.

-      logger.info(s"codeString: $codeString")
+      logger.debug(s"codeString: $codeString")

117-132: Rename method for clarity

diff is too generic. Consider getMissingFiles or findUnindexedFiles.


134-160: Add branch existence validation

pruneBranch should verify branch exists before removal.

 def pruneBranch(branch: Branch): Unit = {
+  require(branchToFileHash.contains(branch), s"Branch $branch does not exist")
   branchToFileHash.remove(branch)
   pruneContents()
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 4beeee3 and 866e5c6.

📒 Files selected for processing (3)
  • orchestration/src/main/resources/log4j2.properties (1 hunks)
  • orchestration/src/main/scala/ai/chronon/orchestration/RepoIndex.scala (1 hunks)
  • orchestration/src/test/scala/ai/chronon/orchestration/test/RepoIndexSpec.scala (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • orchestration/src/main/resources/log4j2.properties
🔇 Additional comments (2)
orchestration/src/test/scala/ai/chronon/orchestration/test/RepoIndexSpec.scala (1)

68-82: LGTM!

Well-structured helper methods with good separation of concerns.

orchestration/src/main/scala/ai/chronon/orchestration/RepoIndex.scala (1)

174-194: Clean implementation of utility functions

Well-structured helper methods with clear type aliases.

)


override def parse(name: String, fileContent: FileContent): Seq[TestConf] = ???
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Implement parse method or mark as unsupported

The unimplemented parse method should either be implemented or throw UnsupportedOperationException with a clear message.

-    override def parse(name: String, fileContent: FileContent): Seq[TestConf] = ???
+    override def parse(name: String, fileContent: FileContent): Seq[TestConf] = 
+      throw new UnsupportedOperationException("Parse method not needed for tests")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
override def parse(name: String, fileContent: FileContent): Seq[TestConf] = ???
override def parse(name: String, fileContent: FileContent): Seq[TestConf] =
throw new UnsupportedOperationException("Parse method not needed for tests")

Comment on lines 162 to 170
def addFiles(fileHashes: mutable.Map[Name, FileHash], updatedFiles: Map[String, String], branch: Branch): Unit = {

val nodes: Seq[T] = updatedFiles.iterator.flatMap {
case (name, content) =>
proc.parse(name, FileContent(content))
}.distinct

addNodes(fileHashes, nodes, branch)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle parse failures explicitly

Add error handling for failed node parsing.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (8)
orchestration/src/main/scala/ai/chronon/orchestration/RepoTypes.scala (3)

15-16: LocalData structure.
Clear fields. Consider unit tests.


23-25: FileContent hash.
Consider cryptographic hash for collisions.


29-32: ConfProcessor trait.
Well-defined. Test coverage recommended.

api/py/ai/chronon/staging_query.py (5)

1-4: Add minimal docstring.

A concise docstring in this file would help clarify the module's intent and usage.


10-17: Prevent confusion on parameter ordering.

These assertions are good, but consider a more explicit error message to guide future maintainers if the constructor signature shifts.


18-24: Cross-check parameter overlap at runtime.

Raising a clear error on intersected params is crucial; consider also logging which param triggered it for more clarity.


26-35: Safeguard against large custom JSON.

Storing all leftover args in customJson might cause unwieldy payloads. Consider bounding or logging if it's large.


42-42: Use a new name to avoid re-shadowing.

Reassigning StagingQuery can be confusing if other modules expect the original. A new alias might be clearer.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 866e5c6 and 8f1145d.

📒 Files selected for processing (2)
  • api/py/ai/chronon/staging_query.py (1 hunks)
  • orchestration/src/main/scala/ai/chronon/orchestration/RepoTypes.scala (1 hunks)
🔇 Additional comments (11)
orchestration/src/main/scala/ai/chronon/orchestration/RepoTypes.scala (10)

1-2: Clear package definition.
Looks good.


3-4: Concise object declaration.
No issues found.


5-6: Simple wrapper class.
Works as intended.


7-8: Branch class for naming.
Seems fine.


9-10: Hash class usage.
Straightforward.


11-12: LocalHash approach.
Consistent with FileHash.


13-14: GlobalHash representation.
Looks correct.


17-19: Branch singleton 'main'.
Good default reference.


21-22: Version class.
No concerns.


27-28: Table class.
No issues.

api/py/ai/chronon/staging_query.py (1)

9-9: Neat pattern for dynamic dispatch.

The _metadata_shim approach is clever and flexible.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
orchestration/src/test/scala/ai/chronon/orchestration/test/RepoIndexSpec.scala (2)

13-17: Add scaladoc for TestConf

Document the purpose and fields of this test configuration class.

+/**
+ * Test configuration case class representing a node in the repository.
+ *
+ * @param name Node identifier
+ * @param queryVersion Version of the query
+ * @param params Configuration parameters
+ * @param parents Parent node identifiers
+ * @param outputs Output node identifiers
+ */
 case class TestConf(name: String,

36-139: Break down the test case for better maintainability

Large test case should be split into smaller, focused test cases for each scenario:

  1. Initial setup
  2. Branch creation
  3. Version updates
  4. Reverts
  5. Node deletions
  6. New node additions
+"RepoIndex" should {
+  // Setup code here...
+
+  it should "handle initial node setup" in {
+    // Test initial setup
+  }
+
+  it should "handle version updates in branch" in {
+    // Test version updates
+  }
+
+  // More test cases...
+}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 8f1145d and 2fa86a4.

📒 Files selected for processing (1)
  • orchestration/src/test/scala/ai/chronon/orchestration/test/RepoIndexSpec.scala (1 hunks)
🔇 Additional comments (2)
orchestration/src/test/scala/ai/chronon/orchestration/test/RepoIndexSpec.scala (2)

1-11: LGTM!

Clean imports and proper test class setup.


33-33: Implement parse method or mark as unsupported

The unimplemented parse method needs attention.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
orchestration/src/main/scala/ai/chronon/orchestration/RepoIndex.scala (1)

34-81: Optimize hash computation

Long method with nested logic. Consider extracting parent hash computation.

- def computeGlobalHash(name: Name): GlobalHash = {
+ private def computeParentHashes(parents: Seq[Name]): String = {
+   parents.map { parent =>
+     val parentHash = computeGlobalHash(parent).hash
+     s"${parent.name}:$parentHash"
+   }.mkString(",")
+ }
+
+ def computeGlobalHash(name: Name): GlobalHash = {
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 2fa86a4 and a77ee24.

📒 Files selected for processing (6)
  • api/py/ai/chronon/utils.py (1 hunks)
  • orchestration/src/main/scala/ai/chronon/orchestration/RepoIndex.scala (1 hunks)
  • orchestration/src/main/scala/ai/chronon/orchestration/RepoTypes.scala (1 hunks)
  • orchestration/src/main/scala/ai/chronon/orchestration/VersionUpdate.scala (1 hunks)
  • orchestration/src/main/scala/ai/chronon/orchestration/utils/StringExtensions.scala (1 hunks)
  • orchestration/src/test/scala/ai/chronon/orchestration/test/RepoIndexSpec.scala (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • api/py/ai/chronon/utils.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • orchestration/src/test/scala/ai/chronon/orchestration/test/RepoIndexSpec.scala
  • orchestration/src/main/scala/ai/chronon/orchestration/VersionUpdate.scala
  • orchestration/src/main/scala/ai/chronon/orchestration/RepoTypes.scala
🔇 Additional comments (3)
orchestration/src/main/scala/ai/chronon/orchestration/utils/StringExtensions.scala (1)

14-21: 8-char truncation significantly reduces collision resistance

Taking only first 8 chars of hash increases collision probability.

Run this to check for potential collisions:

✅ Verification successful

Truncated MD5 usage is appropriate for current implementation

The 8-char MD5 truncation is used for caching and identification purposes, not for security-critical operations. Current usage patterns in the codebase show this is an acceptable trade-off between uniqueness and brevity.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Find potential hash collisions in codebase
rg -l 'md5' | xargs cat | grep -o '"[^"]*"' | sort | uniq -d

Length of output: 415


Script:

#!/bin/bash
# Find actual usage of md5 function
ast-grep --pattern 'md5'

# Also search for direct string references
rg -B 2 -A 2 '\.md5'

Length of output: 7922

orchestration/src/main/scala/ai/chronon/orchestration/RepoIndex.scala (2)

14-20: Document mutable state management

Multiple mutable maps without thread safety controls. Consider documenting thread-safety requirements.


166-174: Handle parse failures explicitly

Previous review comment still applies. Add error handling for failed node parsing.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
orchestration/src/main/scala/ai/chronon/orchestration/RepoIndex.scala (1)

284-284: 🛠️ Refactor suggestion

Avoid using null

Replace null with Option[T] for better type safety.

Apply this diff:

- val artifactContent = NodeContent[T](artifactData, null)
+ val artifactContent = NodeContent[T](artifactData, None)
🧹 Nitpick comments (3)
orchestration/src/main/scala/ai/chronon/orchestration/VersionUpdate.scala (1)

21-33: Consider filtering unchanged entries

In the join method, you might filter out VersionUpdates where no changes occurred to focus on relevant updates.

orchestration/src/main/scala/ai/chronon/orchestration/RepoIndex.scala (2)

133-133: Typo in log message

Avoid repeated 'into' in the log message.

Apply this diff:

- logger.info("Not a dry run! Inserting new nodes into the index into branch: " + branch.name)
+ logger.info("Not a dry run! Inserting new nodes into the index for branch: " + branch.name)

164-172: Use Name type for consistency

In addFiles, change updatedFiles keys to Name for consistency.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between a77ee24 and 1f14c98.

📒 Files selected for processing (4)
  • orchestration/src/main/scala/ai/chronon/orchestration/RepoIndex.scala (1 hunks)
  • orchestration/src/main/scala/ai/chronon/orchestration/VersionUpdate.scala (1 hunks)
  • orchestration/src/main/scala/ai/chronon/orchestration/utils/SequenceMap.scala (1 hunks)
  • orchestration/src/test/scala/ai/chronon/orchestration/test/RepoIndexSpec.scala (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • orchestration/src/test/scala/ai/chronon/orchestration/test/RepoIndexSpec.scala
🔇 Additional comments (1)
orchestration/src/main/scala/ai/chronon/orchestration/utils/SequenceMap.scala (1)

14-79: Looks good

The SequenceMap implementation is clear and effective.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
build.sbt (1)

404-405: Consolidate redundant main class configurations.

Both configurations point to the same class. Consider using a variable:

+  lazy val mainClassName = "ai.chronon.orchestration.RepoParser"
-  assembly / mainClass := Some("ai.chronon.orchestration.RepoParser"),
-  Compile / run / mainClass := Some("ai.chronon.orchestration.RepoParser"),
+  assembly / mainClass := Some(mainClassName),
+  Compile / run / mainClass := Some(mainClassName),
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 1f14c98 and 35fe7ea.

📒 Files selected for processing (1)
  • build.sbt (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (5)
  • GitHub Check: other_spark_tests
  • GitHub Check: mutation_spark_tests
  • GitHub Check: join_spark_tests
  • GitHub Check: fetcher_spark_tests
  • GitHub Check: no_spark_scala_tests
🔇 Additional comments (2)
build.sbt (2)

406-406: LGTM!

Standard configuration for including resources.


409-409: Verify log4j-api-scala version.

Version "13.1.0" seems unusually high. The latest version is 2.20.0.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
orchestration/src/main/scala/ai/chronon/orchestration/RepoIndex.scala (1)

61-164: Consider breaking down addNodes for better maintainability

The method is complex with nested functions. Consider extracting computeGlobalHash and version computation logic into separate methods.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 35fe7ea and ce411b4.

📒 Files selected for processing (2)
  • orchestration/src/main/scala/ai/chronon/orchestration/RepoIndex.scala (1 hunks)
  • orchestration/src/test/scala/ai/chronon/orchestration/test/RepoIndexSpec.scala (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (7)
  • GitHub Check: other_spark_tests
  • GitHub Check: mutation_spark_tests
  • GitHub Check: join_spark_tests
  • GitHub Check: table_utils_delta_format_spark_tests
  • GitHub Check: scala_compile_fmt_fix
  • GitHub Check: fetcher_spark_tests
  • GitHub Check: no_spark_scala_tests
🔇 Additional comments (10)
orchestration/src/test/scala/ai/chronon/orchestration/test/RepoIndexSpec.scala (6)

40-40: Implement parse method or mark as unsupported

-    override def parse(name: String, fileContent: FileContent): Seq[TestConf] = ???
+    override def parse(name: String, fileContent: FileContent): Seq[TestConf] = 
+      throw new UnsupportedOperationException("Parse method not needed for tests")

58-85: LGTM! Well-structured helper method


87-111: LGTM! Comprehensive test setup


113-143: LGTM! Thorough test coverage for branch updates


144-154: LGTM! Good coverage of deletion scenarios


156-180: LGTM! Comprehensive test coverage

orchestration/src/main/scala/ai/chronon/orchestration/RepoIndex.scala (4)

12-52: LGTM! Well-documented class structure


287-290: Avoid null in Scala

-        val artifactContent = NodeContent[T](artifactData, null)
+        val artifactContent = NodeContent[T](artifactData, None)

185-190: Prevent exception when fileHashMap is None

-          lazy val fileHashAbsentForName = !fileHashMap.get.contains(incomingHash)
+          lazy val fileHashAbsentForName = !fileHashMap.exists(_.contains(incomingHash))

200-231: LGTM! Clean implementation of pruning logic

@nikhil-zlai nikhil-zlai changed the title [WIP] Branch versioning logic Branch versioning logic Jan 9, 2025
@nikhil-zlai nikhil-zlai enabled auto-merge (squash) January 9, 2025 05:22
Copy link
Contributor

@piyush-zlai piyush-zlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Largely looks good to me. Dropped a few questions to clarify stuff and some comments. As Varant was also going to look as he's directly integrating immediately, I'll leave to him to stamp once he's gone through.

}

/**
* Node content represents the actual data that is stored in the index.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment seems off right? Should be for the NodeContent case class below

@@ -262,6 +262,7 @@ def join_part_output_table_name(join, jp, full_name: bool = False):
def partOutputTable(jp: JoinPart): String = (Seq(join.metaData.outputTable) ++ Option(jp.prefix) :+
jp.groupBy.metaData.cleanName).mkString("_")
"""
print(join)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yank?

libraryDependencies ++= Seq(
"org.apache.logging.log4j" %% "log4j-api-scala" % "13.1.0",
"org.apache.logging.log4j" % "log4j-core" % "2.20.0",
// "org.slf4j" % "slf4j-api" % slf4jApiVersion,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?


## Scenarios within experimentation

While developing on a branch - users could
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A q - wdyt about supporting environments (e.g. qa / preprod / ..). In my exp at Stripe and other places you might have a branch say named master but the datasets and topics in 'qa' would differ from 'prod'. Would we model envs differently or map them to branches?
(Can discuss this async as well - thought of it so wanted to flag)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am modeling this more along the lines of sqlmesh - one prod data environment in which branches operate.

ml needs full scale parallel end-points so i was treating that as prio. Env's can be added on later also.

One approach to mitigate this is to, only make the CLI only pick up changes to files listed as edited by
commits to the git branch.

Another approach is to force user to rebase on any change to the repo. However, this does not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this might be the better / safer option if we're trying to reuse etc. A super outdated user repo and prod might lead us to funky / hard to debug issues I fear..

Copy link
Contributor Author

@nikhil-zlai nikhil-zlai Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versioning should prevent that.

I am afraid that forcing a rebase would force progress to be lost - expensive backfills, model training runs etc


### Merging changes into `main`

- Deletes should actually trigger asset and pipeline clean up.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some of these we might want to consider 'soft' deletes (mark for hard delete n days down the line) to guard against users causing prod outages (this is not strictly our problem but it's a nice guard rail)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree!

*
* NOTE: This class is not thread safe. Only one thread should access this index at a time.
*/
class RepoIndex[T >: Null](proc: ConfProcessor[T]) extends Logging {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious is the plan to also persist addFiles invocations to the KV store or will we add a different api to sync the contents of the repo index to the kv store when we've added all the relevant files to a given branch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am hoping temporal will magically make persistence work. I might be wrong.

*/
private val versionSequencer: SequenceMap[Name, GlobalHash] = new SequenceMap[Name, GlobalHash]

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing desc of method?

case (name, incomingHash) =>
val fileHashMap = fileHashToContent.get(name)

lazy val nameAbsentInIndex = fileHashMap.isEmpty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these don't need to be lazy right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true will revert


import java.io.File
import scala.util.Failure
import scala.util.Success

// parses a folder of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incomplete?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will change.

@nikhil-zlai nikhil-zlai merged commit bcc7d65 into main Jan 10, 2025
11 checks passed
@nikhil-zlai nikhil-zlai deleted the orch-2 branch January 10, 2025 03:30
kumar-zlai pushed a commit that referenced this pull request Apr 25, 2025
## Summary

Writing down the cases & logic behind branching. Will continue writing
about how these are internally represented.

## Checklist
- [x] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [x] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Release Notes

- **Bug Fixes**
- Corrected a typo in `TableDependency` struct, renaming `forceComputae`
to `forceCompute`.

- **New Features**
    - Added support for branch-based workflows in data orchestration.
    - Introduced `RepoIndex` for managing repository data and versions.
    - Created `SequenceMap` utility for managing unique value sequences.
    - Added `TablePrinter` for formatted data output.
    - Introduced `VersionUpdate` class for tracking version changes.
    - Added `StringExtensions` for MD5 hashing of strings.

- **Refactoring**
    - Removed `LineageIndex` and `LogicalSet` classes.
    - Updated naming conventions for logical nodes.
    - Restructured repository parsing and version tracking mechanisms.

- **Documentation**
    - Added comprehensive README for branch workflow management.
- Expanded documentation for global planning and batch workload
processing.

- **Chores**
    - Updated logging dependencies.
    - Modified build configuration.
    - Added new test specifications.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
kumar-zlai pushed a commit that referenced this pull request Apr 29, 2025
## Summary

Writing down the cases & logic behind branching. Will continue writing
about how these are internally represented.

## Checklist
- [x] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [x] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Release Notes

- **Bug Fixes**
- Corrected a typo in `TableDependency` struct, renaming `forceComputae`
to `forceCompute`.

- **New Features**
    - Added support for branch-based workflows in data orchestration.
    - Introduced `RepoIndex` for managing repository data and versions.
    - Created `SequenceMap` utility for managing unique value sequences.
    - Added `TablePrinter` for formatted data output.
    - Introduced `VersionUpdate` class for tracking version changes.
    - Added `StringExtensions` for MD5 hashing of strings.

- **Refactoring**
    - Removed `LineageIndex` and `LogicalSet` classes.
    - Updated naming conventions for logical nodes.
    - Restructured repository parsing and version tracking mechanisms.

- **Documentation**
    - Added comprehensive README for branch workflow management.
- Expanded documentation for global planning and batch workload
processing.

- **Chores**
    - Updated logging dependencies.
    - Modified build configuration.
    - Added new test specifications.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
chewy-zlai pushed a commit that referenced this pull request May 15, 2025
## Summary

Writing down the cases & logic behind branching. Will continue writing
about how these are internally represented.

## Checklist
- [x] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [x] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Release Notes

- **Bug Fixes**
- Corrected a typo in `TableDependency` struct, renaming `forceComputae`
to `forceCompute`.

- **New Features**
    - Added support for branch-based workflows in data orchestration.
    - Introduced `RepoIndex` for managing repository data and versions.
    - Created `SequenceMap` utility for managing unique value sequences.
    - Added `TablePrinter` for formatted data output.
    - Introduced `VersionUpdate` class for tracking version changes.
    - Added `StringExtensions` for MD5 hashing of strings.

- **Refactoring**
    - Removed `LineageIndex` and `LogicalSet` classes.
    - Updated naming conventions for logical nodes.
    - Restructured repository parsing and version tracking mechanisms.

- **Documentation**
    - Added comprehensive README for branch workflow management.
- Expanded documentation for global planning and batch workload
processing.

- **Chores**
    - Updated logging dependencies.
    - Modified build configuration.
    - Added new test specifications.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
chewy-zlai pushed a commit that referenced this pull request May 16, 2025
## Summary

Writing down the cases & logic behind branching. Will continue writing
about how these are internally represented.

## Cheour clientslist
- [x] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [x] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Release Notes

- **Bug Fixes**
- Corrected a typo in `TableDependency` struct, renaming `forceComputae`
to `forceCompute`.

- **New Features**
    - Added support for branch-based workflows in data orchestration.
    - Introduced `RepoIndex` for managing repository data and versions.
    - Created `SequenceMap` utility for managing unique value sequences.
    - Added `TablePrinter` for formatted data output.
    - Introduced `VersionUpdate` class for traour clientsing version changes.
    - Added `StringExtensions` for MD5 hashing of strings.

- **Refactoring**
    - Removed `LineageIndex` and `LogicalSet` classes.
    - Updated naming conventions for logical nodes.
    - Restructured repository parsing and version traour clientsing mechanisms.

- **Documentation**
    - Added comprehensive README for branch workflow management.
- Expanded documentation for global planning and batch workload
processing.

- **Chores**
    - Updated logging dependencies.
    - Modified build configuration.
    - Added new test specifications.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants