-
Notifications
You must be signed in to change notification settings - Fork 0
Connect drift metrics computation in Spark with Hub for serving to frontend #95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe changes in this pull request primarily focus on enhancing dependency management and restructuring various components within the application. Key modifications include the introduction of a new variable for Circe libraries in Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant JoinController
participant MonitoringModelStore
User->>JoinController: GET /api/v1/joins
JoinController->>MonitoringModelStore: list(offset, limit)
MonitoringModelStore-->>JoinController: List of joins
JoinController-->>User: JSON response with joins
Possibly related PRs
Suggested reviewers
Poem
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 22
🧹 Outside diff range and nitpick comments (56)
hub/app/controllers/Paginate.scala (1)
8-10
: Consider adding input validation and documentation.The method could benefit from:
- Input validation for negative offset/limit values
- Documentation for the type parameter and parameters
- Handling of edge cases when offset exceeds the sequence length
Consider this improvement:
+ /** Paginates a sequence of elements + * @tparam T the type of elements in the sequence + * @param results the sequence to paginate + * @param offset the starting position (must be non-negative) + * @param limit the maximum number of elements to return (must be positive) + * @return a subsequence of elements based on the pagination parameters + */ def paginateResults[T](results: Seq[T], offset: Int, limit: Int): Seq[T] = { + require(offset >= 0, "offset must be non-negative") + require(limit > 0, "limit must be positive") + require(limit <= maxLimit, s"limit must not exceed $maxLimit") results.slice(offset, offset + limit) }hub/app/module/ModelStoreModule.scala (3)
13-13
: Update variable name to reflect current architectureThe variable name
dynamoDBMonitoringStore
is outdated since we're moving away from DynamoDB to an in-memory store.- val dynamoDBMonitoringStore = new MonitoringModelStore(awsApiImpl) + val monitoringStore = new MonitoringModelStore(awsApiImpl)
9-16
: Architectural concerns with AWS dependencyAccording to the PR objectives, we're moving away from DynamoDB to an InMemoryKV store. However, this module still creates an AWS dependency. This seems inconsistent with the stated architectural direction.
Consider:
- Creating a separate module for in-memory store configuration
- Using an interface/trait to abstract the storage implementation
- Providing a configuration flag to switch between storage implementations
Example approach:
trait StorageBackend class InMemoryStorage extends StorageBackend class AwsStorage(awsApi: AwsApiImpl) extends StorageBackend class ModelStoreModule(environment: Environment, configuration: Configuration) extends AbstractModule { override def configure(): Unit = { val storageType = configuration.get[String]("storage.type") val backend = storageType match { case "memory" => new InMemoryStorage() case "aws" => new AwsStorage(new AwsApiImpl(configuration.get[Configuration]("aws").underlying)) case _ => throw new IllegalArgumentException(s"Unsupported storage type: $storageType") } bind(classOf[StorageBackend]).toInstance(backend) bind(classOf[MonitoringModelStore]).toInstance(new MonitoringModelStore(backend)) } }
9-9
: Unused constructor parametersThe
environment
parameter is not used in the module. Consider removing it if it's not needed.- class ModelStoreModule(environment: Environment, configuration: Configuration) + class ModelStoreModule(configuration: Configuration)hub/app/module/DriftStoreModule.scala (3)
10-16
: Document lifecycle implications of eager singleton.While using
asEagerSingleton()
is appropriate for in-memory storage, please add documentation explaining:
- Why eager initialization is necessary
- Memory implications
- Startup time considerations
Consider adding a comment like:
override def configure(): Unit = { // TODO swap to concrete api impl in a follow up + // Using eager singleton to ensure immediate initialization of in-memory store + // and prevent concurrent initialization issues during requests bind(classOf[DriftStore]).toProvider(classOf[DriftStoreProvider]).asEagerSingleton() }
13-13
: Create tracking issue for the TODO.The TODO comment indicates a pending task to implement a concrete API.
Would you like me to create a GitHub issue to track this TODO for implementing the concrete API implementation?
18-22
: Enhance provider robustness with validation and logging.Consider adding validation and logging to make the provider more robust and debuggable.
class DriftStoreProvider @Inject() (kvStore: KVStore) extends Provider[DriftStore] { + private val logger = play.api.Logger(getClass) + override def get(): DriftStore = { + logger.info("Initializing DriftStore with KVStore") + require(kvStore != null, "KVStore cannot be null") new DriftStore(kvStore) } }docker-init/demo/README.md (1)
1-9
: Add prerequisites and environment setup detailsWhile the Docker launch instructions are clear, the documentation would benefit from additional context:
- Required prerequisites (Docker, minimum versions)
- Expected initial setup (repository clone, environment variables if any)
- System requirements (minimum memory, disk space)
Consider adding a prerequisites section:
+# Prerequisites + +- Docker Engine 20.10.0 or higher +- Docker Compose v2.0.0 or higher +- Minimum 16GB RAM recommended +- At least 10GB free disk space + To populate the observability demo data:hub/conf/application.conf (1)
32-32
: Consider adding configuration for in-memory store limits and persistence strategy.While using an in-memory KV store addresses the immediate performance concerns, consider:
- Memory consumption limits to prevent OOM issues
- Data persistence strategy across service restarts
- Scalability implications in a production environment
Consider adding configuration parameters for:
+# In-memory store configuration +inmemory.store { + # Maximum memory allocation (e.g., 75% of available heap) + max.memory = "75%" + # Persistence strategy (none, file, redis) + persistence.strategy = "none" + # Backup interval if persistence enabled + backup.interval = "15m" +}hub/app/controllers/ModelController.scala (1)
7-7
: Architectural change: Transition to in-memory storageThe switch to
MonitoringModelStore
aligns with the PR objective to use in-memory storage for better performance. However, we should consider data persistence implications.Consider:
- Implementing a persistence strategy for the in-memory store
- Adding monitoring for memory usage
- Documenting the maximum data size limitations
hub/app/controllers/JoinController.scala (3)
3-8
: Consider adding error handling and logging importsThe controller might benefit from explicit error handling and logging capabilities for better observability and debugging.
Consider adding:
+import play.api.Logger +import scala.util.{Try, Success, Failure}
11-13
: Enhance documentation for Join entitiesThe documentation could be more descriptive about what Join entities represent and their significance in the context of drift metrics.
Consider expanding the documentation:
- * Controller for the Zipline Join entities + * Controller for managing Zipline Join entities, which represent the relationships + * between different data sources used in drift metric computations. + * + * @see Related documentation about drift metrics and data joining strategy
24-40
: Consider implementing caching for performanceGiven that this endpoint serves drift metrics data which might be computationally expensive, consider implementing caching.
You could:
- Use Play's built-in caching mechanism
- Implement HTTP caching headers
- Consider implementing ETags for efficient caching
Example implementation:
def list(offset: Option[Int], limit: Option[Int]): Action[AnyContent] = { Action.async { implicit request => // Generate cache key based on pagination parameters val cacheKey = s"joins_${offset}_${limit}" cache.getOrElseUpdate(cacheKey, duration = 5.minutes) { // Existing implementation wrapped in Future Future.successful { // ... existing code ... } } } }hub/app/controllers/SearchController.scala (2)
20-21
: Enhance API documentation for clarityWhile the documentation updates correctly reflect the transition to joins, consider adding:
- Search behavior details (e.g., case sensitivity, partial matches)
- Example search terms
- Response format description
/** - * Powers the /api/v1/search endpoint. Returns a list of joins - * @param term - Search term to search for (currently we only support searching join names) + * Powers the /api/v1/search endpoint. Returns a list of joins matching the search criteria + * @param term - Search term to search for (case-sensitive partial match on join names) + * @param offset - For pagination. We skip over offset entries before returning results (default: 0) + * @param limit - Number of elements to return (default: 10, max: configured limit) + * @return JSON response containing matched joins with format: {"offset": number, "joins": Join[]} */
43-46
: Consider performance and reliability improvementsWhile the current implementation is functional and aligns with the learning note about future enhancements, consider:
- Performance: The current string matching (
contains
) might not scale well with large datasets- Reliability: Add error handling for potential store failures
private def searchRegistry(term: String): Seq[Join] = { - val joins = monitoringStore.getJoins - joins.filter(j => j.name.contains(term)) + try { + val joins = monitoringStore.getJoins + // TODO: Consider adding index-based search for better performance + joins.filter(j => j.name.toLowerCase.contains(term.toLowerCase)) + } catch { + case e: Exception => + logger.error(s"Failed to search joins: ${e.getMessage}", e) + Seq.empty + } }docker-init/Dockerfile (4)
46-46
: Consider optimizing Spark binary copyingInstead of copying all Spark binaries, consider copying only the essential ones needed for your specific use case (e.g.,
spark-submit
,spark-shell
). This would reduce the image size.-COPY --from=spark-source /opt/bitnami/spark/bin /opt/spark/bin +COPY --from=spark-source /opt/bitnami/spark/bin/spark-submit /opt/spark/bin/ +COPY --from=spark-source /opt/bitnami/spark/bin/spark-shell /opt/spark/bin/
Line range hint
1-2
: Pin the base image version for reproducibilityUsing
latest
tag for the base image can lead to inconsistent builds over time. Consider pinning to a specific version.-FROM base_image:latest AS build_env +FROM base_image:1.x.x AS build_env # Replace with specific version
Line range hint
1-58
: Consider adding Spark resource configurationGiven that this Dockerfile supports drift metrics computation in Spark and bulk HTTP calls, consider adding environment variables for Spark resource configuration (memory, cores) to optimize performance for your specific workload.
Example configuration to add:
+# Set default Spark configuration +ENV SPARK_DRIVER_MEMORY=4g +ENV SPARK_EXECUTOR_MEMORY=4g +ENV SPARK_WORKER_CORES=2
Update NumPy and Python versions for enhanced security
Based on the verification results:
- NumPy 1.22.0 is no longer the most secure version, with 2.1.3 being the latest stable release
- All known security vulnerabilities are patched in versions newer than 1.22.0
- Python 3.8 is not required for newer NumPy versions and is approaching end-of-life
Recommended changes:
- Upgrade to Python 3.11+ which offers better security features and longer support
- Update NumPy to the latest stable version (2.1.3)
- Update the corresponding dependencies in requirements.txt
🔗 Analysis chain
Line range hint
11-15
: Verify if Python/NumPy versions are still optimal for securityWhile Python 3.8 was chosen for numpy 1.22.0 compatibility due to security concerns, please verify if:
- There are newer Python versions that support secure numpy versions
- numpy 1.22.0 is still the recommended version for security
Let me verify Python version compatibility with newer NumPy versions to provide a complete assessment.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Check for latest numpy versions and their Python compatibility curl -s https://pypi.org/pypi/numpy/json | jq -r '.releases | keys[]' | sort -V | tail -n 5 # Check for any security advisories gh api graphql -f query=' { securityVulnerabilities(first: 5, ecosystem: PIP, package: "numpy") { nodes { advisory { summary severity publishedAt } vulnerableVersionRange firstPatchedVersion { identifier } } } }'Length of output: 1457
Script:
#!/bin/bash # Check Python version in Dockerfile rg -U "FROM python:|RUN.*python" docker-init/Dockerfile # Check numpy version in requirements fd requirements.txt --exec cat {} \; || fd setup.py --exec cat {} \;Length of output: 565
online/src/main/scala/ai/chronon/online/HTTPKVStore.scala (2)
10-11
: Remove or clarify the "hacky" commentThe comment suggests this is a temporary solution, but the code is being integrated into the main workflow. Either remove the comment if this is intended to be a permanent solution, or document the planned replacement approach.
52-56
: Consider using a JSON library for safer serializationManual JSON string construction is error-prone and potentially unsafe. Consider using a lightweight JSON library that's compatible with Spark serialization.
hub/app/model/Model.scala (3)
8-13
: Convert implementation notes into proper documentation.These numbered comments appear to outline a planned feature pipeline for metadata and tile processing. Consider converting these implementation notes into proper ScalaDoc documentation that explains the overall workflow, each step's purpose, and their relationships.
Example structure:
/** * Metadata Processing Pipeline * * The system processes metadata and tiles through the following stages: * 1. Metadata Upload: Processes join to mapping of group-by columns * 2. Configuration Fetching: Retrieves join configuration and column listings * ... */
59-59
: Update or remove outdated comment.The comment "Currently search only covers joins" seems outdated given the introduction of both Search and List responses for joins.
59-62
: Consider consolidating duplicate response models.
SearchJoinResponse
andListJoinResponse
have identical structures. Consider:
- Using a single response type if they serve the same purpose
- Adding documentation to clarify the distinction if they serve different purposes
- Using a type parameter for consistency:
PagedResponse[T](offset: Int, items: Seq[T])
Example refactor:
case class PagedResponse[T](offset: Int, items: Seq[T]) type ListModelResponse = PagedResponse[Model] type ListJoinResponse = PagedResponse[Join] type SearchJoinResponse = PagedResponse[Join]hub/test/controllers/JoinControllerSpec.scala (5)
1-18
: Consider organizing imports for better maintainabilityThe test class setup looks good, but consider grouping related imports together (e.g., Play Framework, Circe, testing utilities) for better readability.
package controllers -import controllers.MockJoinService.mockJoinRegistry -import io.circe._ -import io.circe.generic.auto._ -import io.circe.parser._ -import model.ListJoinResponse -import org.mockito.Mockito._ -import org.scalatest.EitherValues -import org.scalatestplus.play._ -import play.api.http.Status.BAD_REQUEST -import play.api.http.Status.OK -import play.api.mvc._ -import play.api.test.Helpers._ -import play.api.test._ -import store.MonitoringModelStore +// Play Framework +import play.api.mvc._ +import play.api.test._ +import play.api.test.Helpers._ +import play.api.http.Status.{BAD_REQUEST, OK} + +// Circe +import io.circe._ +import io.circe.generic.auto._ +import io.circe.parser._ + +// Testing +import org.scalatest.EitherValues +import org.scalatestplus.play._ +import org.mockito.Mockito._ + +// Application +import controllers.MockJoinService.mockJoinRegistry +import model.ListJoinResponse +import store.MonitoringModelStore
20-26
: Update comment to reflect current implementationThe comment mentions "DynDB store" but the class is now called
MonitoringModelStore
. Update the comment to maintain consistency with the current implementation.- // Create a mocked DynDB store + // Create a mocked monitoring model store val mockedStore: MonitoringModelStore = mock(classOf[MonitoringModelStore])
27-64
: Enhance test coverage with additional scenariosWhile the current tests cover basic functionality, consider adding:
- Edge cases:
- Empty result set
- Maximum limit scenarios
- Boundary conditions for offset/limit
- Verification of mock interactions
- Validation of total count in pagination response
Example implementation:
"handle empty results correctly" in { when(mockedStore.getJoins).thenReturn(Seq.empty) val result = controller.list(None, None).apply(FakeRequest()) status(result) mustBe OK val response = decode[ListJoinResponse](contentAsString(result)).right.value response.items mustBe empty } "respect maximum limit" in { when(mockedStore.getJoins).thenReturn(mockJoinRegistry) val result = controller.list(None, Some(1000)).apply(FakeRequest()) status(result) mustBe OK val response = decode[ListJoinResponse](contentAsString(result)).right.value response.items.length mustBe controller.maxLimit // Assuming maxLimit is defined }
54-56
: Extract magic numbers into named constantsThe test uses magic numbers (25, 20) for offset and limit. Consider extracting these into named constants for better maintainability and clarity.
+ private val TestOffset = 25 + private val TestLimit = 20 + "send results in a paginated fashion correctly" in { when(mockedStore.getJoins).thenReturn(mockJoinRegistry) - val startOffset = 25 - val number = 20 + val startOffset = TestOffset + val number = TestLimit
40-41
: Verify mock interactionsThe tests set up mock expectations but don't verify the number of times the mock was called. Add verification to ensure the mock is used as expected.
"send valid results on a correctly formed request" in { when(mockedStore.getJoins).thenReturn(mockJoinRegistry) val result = controller.list(None, None).apply(FakeRequest()) + verify(mockedStore, times(1)).getJoins
Also applies to: 52-53
hub/test/controllers/SearchControllerSpec.scala (4)
43-49
: Consider making test data verification more maintainableThe test correctly verifies the response type and content, but relies on hardcoded values for verification. Consider using values derived from
mockJoinRegistry
to make the test more maintainable.- items.map(_.name.toInt).toSet mustBe Set(1, 10, 11, 12, 13, 14, 15, 16, 17, 18) + val expectedItems = mockJoinRegistry.map(_.name.toInt).filter(_ toString contains "1").take(controller.defaultLimit).toSet + items.map(_.name.toInt).toSet mustBe expectedItems
Line range hint
54-69
: Document the pagination test logicThe pagination test case could benefit from clearer documentation explaining the expected results and the test data structure.
"send results in a paginated fashion correctly" in { when(mockedStore.getJoins).thenReturn(mockJoinRegistry) + // Test pagination with: + // - mockJoinRegistry containing joins with IDs 0-99 + // - filtering for IDs containing "1" + // - starting from offset 3 (4th item) + // - limiting to 6 items val startOffset = 3 val number = 6 val result = controller.search("1", Some(startOffset), Some(number)).apply(FakeRequest()) - // we have names: 0, 1, 2, .. 99 - // our result should give us: 1, 10, 11, 12, .. 19, 21, 31, .. 91
73-80
: Enhance mock data generation for better test coverageThe mock join generation could be improved to:
- Support different types of group by configurations
- Vary the number of features
- Include edge cases
This would help ensure the tests cover a wider range of scenarios.
object MockJoinService { def generateMockJoin(id: String): Join = { - val groupBys = Seq(GroupBy("my_groupBy", Seq("g1", "g2"))) - Join(id, Seq("ext_f1", "ext_f2", "d_1", "d2"), groupBys) + // Vary the number of group bys and their values based on the id + val numGroupBys = (id.toInt % 3) + 1 // 1 to 3 group bys + val groupBys = (1 to numGroupBys).map(i => + GroupBy(s"groupBy_$i", Seq(s"g${i}_1", s"g${i}_2")) + ) + + // Vary the number of features based on the id + val numFeatures = (id.toInt % 5) + 2 // 2 to 6 features + val features = (1 to numFeatures).map(i => s"feature_$i") + + Join(id, features, groupBys) }
Inconsistency found in store dependency
While the test cases correctly use
getJoins
and join-based functionality, theMonitoringModelStore
type is still being used for mocking. This should be updated to use the join-specific store type for consistency with the join-based transition.
val mockedStore: MonitoringModelStore = mock(classOf[MonitoringModelStore])
should be updated to use the join-specific store type🔗 Analysis chain
Line range hint
3-26
: Verify complete transition from models to joinsThe import changes and class setup correctly reflect the transition to join-based functionality. However, let's verify that all model-related code has been properly migrated.
Let me gather more information to verify if there are any remaining model-related dependencies in the SearchControllerSpec file specifically.
Let me check one final aspect to see if there are any actual usages of the MonitoringModelStore in the test cases.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for any remaining model-related references that might have been missed # Search for any remaining model-related imports or usages rg -i "mockdataservice|mockmodelregistry" --type scala # Search for any remaining model-related classes or methods ast-grep --pattern 'class $_ extends $_ { $$$ getModels($$$) { $$$ } $$$ }'Length of output: 653
Script:
#!/bin/bash # Check for any model-related code specifically in SearchControllerSpec rg -l "Model" hub/test/controllers/SearchControllerSpec.scala # Check for any references to monitoring models or stores rg "MonitoringModel|getModels" hub/test/controllers/SearchControllerSpec.scala # Look for any test cases that might still be using model-based approach ast-grep --pattern 'describe("$_") { $$$ "should $$$" in { $$$ getModels($$$) $$$ } $$$ }'Length of output: 415
Script:
#!/bin/bash # Get the full content of SearchControllerSpec to analyze the test cases rg -A 50 "class SearchControllerSpec" hub/test/controllers/SearchControllerSpec.scala # Check for any method calls on mockedStore rg "mockedStore\." hub/test/controllers/SearchControllerSpec.scalaLength of output: 2315
hub/test/store/MonitoringModelStoreTest.scala (2)
44-47
: Consider updating variable name for consistency.The variable name
dynamoDBMonitoringStore
should be updated to match the new implementation:- val dynamoDBMonitoringStore = new MonitoringModelStore(api) - when(kvStore.list(any())).thenReturn(generateListResponse()) - - validateLoadedConfigs(dynamoDBMonitoringStore) + val monitoringStore = new MonitoringModelStore(api) + when(kvStore.list(any())).thenReturn(generateListResponse()) + + validateLoadedConfigs(monitoringStore)
50-50
: Update parameter name for consistency.The parameter name should reflect the new type:
- private def validateLoadedConfigs(dynamoDBMonitoringStore: MonitoringModelStore): Unit = { + private def validateLoadedConfigs(monitoringStore: MonitoringModelStore): Unit = {hub/conf/routes (3)
16-16
: Consider documenting query parameters.The endpoint has numerous query parameters that would benefit from documentation:
metricType
,metrics
,granularity
,offset
,algorithm
This documentation would help API consumers understand:
- Valid values for each parameter
- Default values
- Parameter dependencies or constraints
18-19
: Address concerns about temporary in-memory KV store.While this endpoint addresses the immediate DynamoDB ingestion issues mentioned in the PR objectives, several concerns need attention:
- Data persistence strategy in case of service restart
- Memory management and potential OOM issues
- Request size limits for bulk operations
- Timeline for replacing this temporary solution
Consider:
- Adding request size limits
- Implementing data persistence
- Planning for a more permanent solution
Remove or document commented API endpoints
The commented-out endpoints appear to be unused across the codebase, as our searches found no references to these endpoints outside of the routes file. Since these endpoints are part of a transition from a model-based to join-based approach, they should either be:
- Removed if they're no longer needed in the new architecture
- Documented with a clear migration timeline if they're planned for future implementation
hub/conf/routes
: Remove the commented endpoints or add a documentation block explaining the migration plan and timeline🔗 Analysis chain
Line range hint
22-31
: Document the API transition plan.There's a significant block of commented-out endpoints with a TODO note about moving to fine-grained endpoints. This raises several concerns:
- The timeline for implementing these fine-grained endpoints
- Whether these endpoints are still needed given the new join-based approach
- Potential impact on API consumers who might be expecting these endpoints
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Check if any of these endpoints are referenced in the codebase rg "GET.*?/api/v1/feature/[^/]+/(percentiles|values|drift|skew|compare)" --type ts --type jsLength of output: 94
Script:
#!/bin/bash # Let's try a broader search to see if these endpoints are referenced anywhere in the codebase rg -l "/api/v1/feature.*?/(percentiles|values|drift|skew|compare)" # Also check for any documentation or migration plans fd -e md -e txt | xargs rg -l "api/v1/feature.*?/(percentiles|values|drift|skew)" # Check the TimeSeriesController implementation ast-grep --pattern 'class TimeSeriesController { $$$ fetchFeature($$$) { $$$ } $$$ }'Length of output: 268
Script:
#!/bin/bash # Let's check the git history of these routes to understand when they were commented out git log -p hub/conf/routes | grep -A 5 "^[-+].*api/v1/feature.*/(percentiles|values|drift|skew|compare)" # Check for any related configuration or feature flags rg -l "feature.*?(percentiles|values|drift|skew|compare)" --type yaml --type conf --type properties # Look for any TODO comments related to API migration rg "TODO.*api" --type scala --type confLength of output: 201
spark/src/main/scala/ai/chronon/spark/scripts/ObservabilityDemo.scala (2)
105-106
: Consider adding detailed success loggingWhile the completion message is clear, consider adding more detailed logging about the number of summaries uploaded and their status. This would help with monitoring and debugging in production.
- println("Done uploading summaries! \uD83E\uDD73".green) + val summaryCount = tableUtils.loadTable(packedTable).count() + logger.info(s"Successfully uploaded $summaryCount summaries! \uD83E\uDD73".green)
Line range hint
1-110
: Document the script's role in the new architectureAs this script is now part of the critical path for connecting Spark-based drift metrics computation with Hub, consider adding documentation that explains:
- The script's role in the overall data flow
- The interaction pattern with Hub via HTTPKVStore
- Expected data volumes and performance characteristics
- Recovery procedures in case of failures
.github/workflows/test_scala_and_python.yaml (2)
185-188
: Consider moving hub tests to a dedicated jobGiven that the hub component now handles bulk operations and drift metrics computation, it would be better to:
- Move these tests to a dedicated job for better isolation
- Configure appropriate resource limits
- Enable parallel execution with other tests
Example structure:
hub_tests: runs-on: ubuntu-latest container: image: ghcr.io/${{ github.repository }}-ci:latest credentials: username: ${{ github.actor }} password: ${{ secrets.GITHUB_TOKEN }} steps: - uses: actions/checkout@v4 - name: Run hub tests run: | export SBT_OPTS="-Xmx8G -Xms2G" sbt "++ 2.12.18 hub/test"🧰 Tools
🪛 yamllint (1.35.1)
[error] 188-188: no new line character at the end of file
(new-line-at-end-of-file)
188-188
: Add newline at end of fileAdd a newline character at the end of the file to follow YAML best practices and resolve the yamllint warning.
🧰 Tools
🪛 yamllint (1.35.1)
[error] 188-188: no new line character at the end of file
(new-line-at-end-of-file)
hub/app/store/MonitoringModelStore.scala (1)
Line range hint
39-45
: Review TTLCache configuration with InMemoryKV store.Since we're transitioning to an InMemoryKV store, consider reviewing the TTLCache configuration (10 seconds timeout) to ensure it aligns with the new storage backend's characteristics. You might want to:
- Adjust the timeout based on the InMemoryKV store's performance characteristics
- Consider adding metrics for cache hits/misses to monitor effectiveness
online/src/main/scala/ai/chronon/online/stats/DriftStore.scala (1)
202-202
: Track and clarify the TODO comment.The TODO suggests two different approaches but lacks context about the tradeoffs and implementation plan.
Would you like me to:
- Create a GitHub issue to track this technical debt?
- Help evaluate the tradeoffs between using a well-known list vs. exposing as part of summaries?
The issue should capture:
- Current implementation limitations
- Pros and cons of each approach
- Impact on the drift metrics computation pipeline
- Dependencies and prerequisites
build.sbt (1)
136-139
: Consider managing STTP client version consistently.While the STTP client is marked for removal, consider defining its version as a variable at the top of the file, consistent with other dependency version management practices.
spark/src/main/scala/ai/chronon/spark/TableUtils.scala (1)
Line range hint
1-1000
: Consider implementing a consistent logging strategyThe codebase shows a pattern of replacing structured logging with println statements. For a production-grade Spark application, it's crucial to:
- Use a consistent logging framework throughout
- Include appropriate log levels (INFO, ERROR, etc.)
- Maintain structured logging for better observability
- Consider implementing a logging facade for better flexibility
This will improve:
- Production monitoring capabilities
- Debugging efficiency
- Log aggregation and analysis
hub/app/controllers/InMemKVStoreController.scala (3)
43-44
: Avoid exposing internal error messages to clientsReturning the raw error message from
decode
can expose internal details to the client, which is not a best practice for security reasons.Consider logging the detailed error message and returning a generic message to the client:
case Left(error) => + logger.error("Failed to decode PutRequests", error) - Future.successful(BadRequest(error.getMessage)) + Future.successful(BadRequest("Invalid request payload"))
29-30
: Simplify request parsing usingparse.json
The current implementation uses
parse.raw
and manually handles byte conversion and JSON decoding, which can be simplified.Consider using
Action.async(parse.json)
to automatically parse the JSON body into aJsValue
, reducing manual parsing and potential errors:-import play.api.mvc.RawBuffer +import play.api.libs.json.JsValue +import play.api.libs.json.Json ... -def bulkPut(): mvc.Action[RawBuffer] = - Action(parse.raw).async { request => - request.body.asBytes() match { - case Some(bytes) => - decode[Array[PutRequest]](bytes.utf8String) match { +def bulkPut(): Action[JsValue] = + Action.async(parse.json) { request => + request.body.validate[Array[PutRequest]] match { + case JsSuccess(putRequests, _) => // Handle success - case Left(error) => Future.successful(BadRequest(error.getMessage)) + case JsError(errors) => + Future.successful(BadRequest("Invalid JSON format"))This approach leverages Play's JSON utilities for parsing and validating the request body.
50-60
: Validate Base64 encoding and decodingThe custom encoders and decoders for
Array[Byte]
using Base64 do not handle potential decoding errors.Ensure that any invalid Base64 strings are properly handled to prevent runtime exceptions. Consider adding error handling or input validation when decoding:
implicit val byteArrayDecoder: Decoder[Array[Byte]] = Decoder.decodeString.emap { str => try { Right(Base64.getDecoder.decode(str)) } catch { case _: IllegalArgumentException => Left("Invalid Base64 string") } }hub/app/controllers/TimeSeriesController.scala (5)
87-112
: Simplify nestedFuture
andTry
handling indoFetchJoinDrift
The current implementation matches over a
Try[Future[...]]
, which introduces nestedFuture
andTry
constructs. This can complicate error handling and readability. Consider usingFuture.fromTry
to flatten theTry[Future[T]]
into aFuture[T]
, simplifying the subsequent code.Apply this refactor to simplify the handling:
- val maybeDriftSeries = driftStore.getDriftSeries(joinPath, driftMetric, window, startTs, endTs) - maybeDriftSeries match { - case Failure(exception) => - Future.successful(InternalServerError(s"Error computing join drift - ${exception.getMessage}")) - case Success(driftSeriesFuture) => - driftSeriesFuture.map { driftSeries => - // processing logic - } - } + val driftSeriesFuture = Future.fromTry(maybeDriftSeries).flatMap(identity) + driftSeriesFuture.map { driftSeries => + // processing logic + }.recover { + case exception => + InternalServerError(s"Error computing join drift - ${exception.getMessage}") + }
140-196
: Simplify asynchronous error handling indoFetchFeatureDrift
In the
doFetchFeatureDrift
method, the nested matching overTry[Future[T]]
can be refactored to improve clarity. UtilizeFuture.fromTry
and combine the futures appropriately to handle errors more gracefully.Refactor suggestion:
- val maybeDriftSeries = driftStore.getDriftSeries(...) - maybeDriftSeries match { - case Failure(exception) => - Future.successful(InternalServerError(...)) - case Success(driftSeriesFuture) => - driftSeriesFuture.map { driftSeries => - // processing logic - } - } + val driftSeriesFuture = Future.fromTry(maybeDriftSeries).flatMap(identity) + driftSeriesFuture.map { driftSeries => + // processing logic + }.recover { + case exception => + InternalServerError(s"Error computing feature drift - ${exception.getMessage}") + }Similarly, when dealing with multiple futures, consider using
Future.sequence
and handle errors using combinators likerecover
orrecoverWith
.
265-269
: Handle unsupported algorithms more gracefully inparseAlgorithm
In the
parseAlgorithm
method, any algorithm not matching the predefined options results inNone
. Consider providing a default algorithm or a more informative error message to guide users.Suggestion:
- case _ => None + case Some(unknown) => Some(DriftMetric.DEFAULT) + // or return an error message indicating the unsupported algorithm
277-279
: Remove commented-out code or annotate with TODO if planned for future useThe cases for
"skew"
and"ooc"
are commented out inparseMetricChoice
. If these metrics are not intended to be supported, consider removing the commented code to keep the codebase clean. If they are planned for future implementation, add a TODO comment for clarity.
131-136
: Improve error messages for invalid parametersThe error responses for invalid
metricChoice
,metricRollup
, andgranularity
provide basic feedback. Enhance these messages to include the invalid value received, aiding debugging and user experience.Apply this diff to improve error messages:
- Future.successful(BadRequest("Invalid metric choice. Expect drift")) + Future.successful(BadRequest(s"Invalid metric choice '$metricType'. Expect 'drift'"))hub/test/controllers/TimeSeriesControllerSpec.scala (2)
38-38
: Use a dedicatedExecutionContext
for test isolationAt line 38,
ExecutionContext.global
is used as the implicitExecutionContext
. Using a global execution context in tests can lead to unpredictable behavior due to shared resources. Consider using a dedicated or stubbedExecutionContext
for better test isolation.Apply this diff to use the execution context from
stubControllerComponents()
:- implicit val ec: ExecutionContext = ExecutionContext.global + implicit val ec: ExecutionContext = stubCC.executionContext
269-275
: Optimize data generation for test efficiencyGenerating a large number of timestamps and data points between
startTs
andendTs
with hourly intervals can slow down tests if the time range is extensive.Consider limiting the number of data points for tests or adjusting the interval to ensure tests run efficiently without sacrificing coverage.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (35)
.github/workflows/test_scala_and_python.yaml
(1 hunks)build.sbt
(3 hunks)docker-init/Dockerfile
(1 hunks)docker-init/demo/Dockerfile
(0 hunks)docker-init/demo/README.md
(1 hunks)docker-init/demo/build.sh
(0 hunks)docker-init/demo/load_summaries.sh
(1 hunks)docker-init/demo/run.sh
(0 hunks)docker-init/start.sh
(0 hunks)hub/app/controllers/InMemKVStoreController.scala
(1 hunks)hub/app/controllers/JoinController.scala
(1 hunks)hub/app/controllers/ModelController.scala
(1 hunks)hub/app/controllers/Paginate.scala
(1 hunks)hub/app/controllers/SearchController.scala
(2 hunks)hub/app/controllers/TimeSeriesController.scala
(3 hunks)hub/app/model/Model.scala
(2 hunks)hub/app/module/DriftStoreModule.scala
(1 hunks)hub/app/module/DynamoDBModule.scala
(0 hunks)hub/app/module/InMemoryKVStoreModule.scala
(1 hunks)hub/app/module/ModelStoreModule.scala
(1 hunks)hub/app/store/MonitoringModelStore.scala
(4 hunks)hub/conf/application.conf
(1 hunks)hub/conf/routes
(1 hunks)hub/test/controllers/JoinControllerSpec.scala
(1 hunks)hub/test/controllers/ModelControllerSpec.scala
(2 hunks)hub/test/controllers/SearchControllerSpec.scala
(4 hunks)hub/test/controllers/TimeSeriesControllerSpec.scala
(4 hunks)hub/test/store/MonitoringModelStoreTest.scala
(2 hunks)online/src/main/scala/ai/chronon/online/HTTPKVStore.scala
(1 hunks)online/src/main/scala/ai/chronon/online/MetadataEndPoint.scala
(1 hunks)online/src/main/scala/ai/chronon/online/stats/DriftStore.scala
(1 hunks)online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala
(1 hunks)spark/src/main/scala/ai/chronon/spark/Driver.scala
(2 hunks)spark/src/main/scala/ai/chronon/spark/TableUtils.scala
(2 hunks)spark/src/main/scala/ai/chronon/spark/scripts/ObservabilityDemo.scala
(3 hunks)
💤 Files with no reviewable changes (5)
- docker-init/demo/Dockerfile
- docker-init/demo/build.sh
- docker-init/demo/run.sh
- docker-init/start.sh
- hub/app/module/DynamoDBModule.scala
✅ Files skipped from review due to trivial changes (2)
- online/src/main/scala/ai/chronon/online/MetadataEndPoint.scala
- online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala
🧰 Additional context used
📓 Learnings (3)
hub/app/controllers/SearchController.scala (1)
Learnt from: piyush-zlai
PR: zipline-ai/chronon#22
File: hub/app/controllers/SearchController.scala:55-58
Timestamp: 2024-11-12T09:38:33.532Z
Learning: Enhancements to search functionality in `SearchController` are planned for future updates; the current implementation is acceptable for now.
hub/test/controllers/SearchControllerSpec.scala (1)
Learnt from: piyush-zlai
PR: zipline-ai/chronon#44
File: hub/app/controllers/ModelController.scala:15-18
Timestamp: 2024-11-12T09:38:33.532Z
Learning: References to `MockDataService` in `hub/test/controllers/SearchControllerSpec.scala` and `hub/test/controllers/ModelControllerSpec.scala` are needed for tests and should not be removed.
hub/test/store/MonitoringModelStoreTest.scala (1)
Learnt from: piyush-zlai
PR: zipline-ai/chronon#44
File: hub/test/store/DynamoDBMonitoringStoreTest.scala:69-86
Timestamp: 2024-11-12T09:38:33.532Z
Learning: In `hub/test/store/DynamoDBMonitoringStoreTest.scala`, the current implementation of the `generateListResponse` method is acceptable as-is, and changes for resource handling and error management are not necessary at this time.
🪛 yamllint (1.35.1)
.github/workflows/test_scala_and_python.yaml
[error] 188-188: no new line character at the end of file
(new-line-at-end-of-file)
🪛 Shellcheck (0.10.0)
docker-init/demo/load_summaries.sh
[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.
(SC2148)
🔇 Additional comments (36)
hub/app/controllers/Paginate.scala (1)
8-8
: LGTM! Good generalization to support multiple types.
The change from Seq[Model]
to Seq[T]
improves reusability and type safety, allowing the pagination functionality to work with any type while maintaining compile-time type checking.
docker-init/demo/load_summaries.sh (1)
3-12
:
Verify resource allocation and add error handling
The Spark configuration needs attention in several areas:
- The memory allocation (8GB) and result size (6GB) are quite high for local testing
- Network binding to 0.0.0.0 might expose the service unnecessarily
- Missing error handling for container state and job execution
Run this script to verify container memory limits:
Apply this diff to add error handling and container validation:
#!/bin/bash
+
+set -euo pipefail
+
+# Verify container is running
+if ! docker-compose -f docker-init/compose.yaml ps app | grep -q "Up"; then
+ echo "Error: app container is not running"
+ echo "Please start containers first using docker-compose up -d"
+ exit 1
+fi
+
+# Configure memory based on available resources
+DRIVER_MEMORY="8g"
+RESULT_SIZE="6g"
+
+# Validate memory configuration
+if ! docker-compose -f docker-init/compose.yaml exec app cat /sys/fs/cgroup/memory/memory.limit_in_bytes > /dev/null 2>&1; then
+ echo "Warning: Unable to verify container memory limits"
+ echo "Proceeding with default memory configuration"
+fi
docker-compose -f docker-init/compose.yaml exec app /opt/spark/bin/spark-submit \
--master "local[*]" \
- --driver-memory 8g \
- --conf "spark.driver.maxResultSize=6g" \
- --conf "spark.driver.memory=8g" \
+ --driver-memory "${DRIVER_MEMORY}" \
+ --conf "spark.driver.maxResultSize=${RESULT_SIZE}" \
+ --conf "spark.driver.memory=${DRIVER_MEMORY}" \
--driver-class-path "/opt/spark/jars/*:/app/cli/*" \
--conf "spark.driver.host=localhost" \
- --conf "spark.driver.bindAddress=0.0.0.0" \
+ --conf "spark.driver.bindAddress=127.0.0.1" \
--class ai.chronon.spark.scripts.ObservabilityDemo \
- /app/cli/spark.jar
+ /app/cli/spark.jar || {
+ echo "Error: Spark job failed"
+ exit 1
+ }
hub/app/module/ModelStoreModule.scala (2)
12-12
:
Empty AWS configuration is potentially problematic
The AWS API is being initialized with an empty configuration map. This seems incorrect and could lead to runtime issues.
Consider utilizing the injected configuration
parameter to properly configure AWS:
- val awsApiImpl = new AwsApiImpl(Map.empty)
+ val awsConfig = configuration.get[Configuration]("aws").underlying
+ val awsApiImpl = new AwsApiImpl(awsConfig)
Likely invalid or redundant comment.
14-14
: Verify singleton binding implications
The MonitoringModelStore is bound as a singleton instance. Please verify that this is the intended behavior and won't cause issues with concurrent access to the in-memory store.
✅ Verification successful
Let me gather more information about the MonitoringModelStore implementation and its usage.
Singleton binding is appropriate for this implementation
The MonitoringModelStore implementation is thread-safe and designed for concurrent access:
- It uses
Future
-based async operations for data retrieval - The state is managed through a
TTLCache
which is designed for concurrent access - The underlying
KVStore
provides its ownExecutionContext
for handling concurrent operations - All mutable state is properly encapsulated in thread-safe constructs
The singleton binding is the correct approach here as the store acts as a shared cache with its own concurrency handling.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for concurrent access patterns to MonitoringModelStore
rg -A 5 "MonitoringModelStore" | grep -i "concurrent|async|future|promise"
# Look for other singleton bindings for comparison
ast-grep --pattern 'bind($_).toInstance($_)'
Length of output: 339
Script:
#!/bin/bash
# Look for MonitoringModelStore implementation and usage
rg -A 10 "class MonitoringModelStore"
# Look for DynamoDBMonitoringStore implementation
rg -A 10 "class DynamoDBMonitoringStore"
# Search for any concurrent/async operations in files related to monitoring store
rg -l "MonitoringModelStore|DynamoDBMonitoringStore" | xargs rg -A 5 "concurrent|async|Future|Promise|synchronized"
Length of output: 5414
hub/app/module/DriftStoreModule.scala (1)
1-9
: LGTM! Clean and well-organized imports.
The imports are logically grouped and all are being used in the implementation.
hub/app/module/InMemoryKVStoreModule.scala (2)
8-9
: LGTM! Clear class purpose documentation.
The comment clearly explains the module's purpose for Docker testing scenarios.
16-16
: Verify thread-safety of the KVStore binding.
Since this is bound as a singleton, verify that InMemoryKvStore
is thread-safe for concurrent access in the testing environment.
✅ Verification successful
Let me gather more information about the thread safety implementation details.
Thread-safety is properly implemented in InMemoryKVStore
The implementation is thread-safe through the use of ConcurrentHashMap
for both the main database and individual tables:
- The class uses
ConcurrentHashMap[DataSet, Table]
for the main database - Each table is implemented as
ConcurrentHashMap[Key, VersionedData]
- The singleton store management is also thread-safe using
ConcurrentHashMap[String, InMemoryKvStore]
These concurrent data structures provide the necessary thread-safety guarantees for the testing environment without requiring additional synchronization mechanisms.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for concurrent usage patterns
rg -A 5 "InMemoryKvStore"
# Look for any synchronization mechanisms
ast-grep --pattern 'class InMemoryKvStore {
$$$
synchronized $$$
$$$
}'
Length of output: 12488
Script:
#!/bin/bash
# Look for concurrent data structures in InMemoryKvStore
rg -A 10 "ConcurrentHashMap|AtomicReference|Concurrent|volatile|synchronized" spark/src/main/scala/ai/chronon/spark/utils/InMemoryKvStore.scala
# Check the implementation of key operations
ast-grep --pattern 'def put($$$) {
$$$
}'
ast-grep --pattern 'def get($$$) {
$$$
}'
ast-grep --pattern 'def create($$$) {
$$$
}'
Length of output: 2247
docker-init/demo/README.md (1)
1-16
: Verify the documented workflow
Let's verify if the documented workflow aligns with the actual implementation and PR objectives.
✅ Verification successful
Let me check the content of devnotes.md
to see if the references to old scripts are just documentation or if they indicate potential inconsistencies.
Documentation accurately reflects the setup and workflow
The verification confirms that:
- The
load_summaries.sh
script exists and is executable - The
docker-compose.yaml
exists and correctly exposes port 9000 as documented - The reference to
build.sh
indevnotes.md
is unrelated to the demo setup, as it's part of the cloud release workflow
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the existence and permissions of the load_summaries.sh script
# and check for any remaining references to old scripts
# Check if load_summaries.sh exists and is executable
echo "Checking load_summaries.sh..."
if [ -x "docker-init/demo/load_summaries.sh" ]; then
echo "✓ load_summaries.sh exists and is executable"
else
echo "✗ load_summaries.sh not found or not executable"
fi
# Check for any remaining references to old scripts
echo -e "\nChecking for references to removed scripts..."
rg -l "build\.sh|run\.sh" --type md
# Verify docker-compose file exists
echo -e "\nChecking docker-compose.yaml..."
if [ -f "docker-init/compose.yaml" ]; then
echo "✓ compose.yaml exists"
else
echo "✗ compose.yaml not found"
fi
Length of output: 542
Script:
#!/bin/bash
# Check the context of build.sh/run.sh references in devnotes.md
echo "Content of devnotes.md with context around build.sh/run.sh:"
rg "build\.sh|run\.sh" -C 3 devnotes.md
# Also verify the docker-compose.yaml content to ensure it matches the documented command
echo -e "\nVerifying docker-compose.yaml path in compose.yaml:"
rg "9000" docker-init/compose.yaml
Length of output: 542
hub/conf/application.conf (1)
31-33
: LGTM! Module changes align with PR objectives.
The replacement of DynamoDBModule with these new modules aligns well with the PR's goal of addressing DynamoDB performance issues and implementing in-memory storage for data serving.
hub/app/controllers/ModelController.scala (2)
Line range hint 22-38
: LGTM: Pagination logic remains unchanged
The pagination and error handling logic is unaffected by the store change, which is good. The controller maintains its responsibility for input validation and response formatting.
15-15
: Verify MonitoringModelStore compatibility
The constructor parameter type change from DynamoDBMonitoringStore
to MonitoringModelStore
needs verification to ensure interface compatibility.
hub/app/controllers/JoinController.scala (1)
35-35
: Verify MonitoringModelStore implementation
Need to ensure that monitoringStore.getJoins
is properly connected to the InMemoryKV store mentioned in the PR objectives.
✅ Verification successful
Let me gather more information about the MonitoringModelStore implementation and its connection to InMemoryKV.
Let me check one more thing to verify the getJoins implementation and its connection to the KVStore.
Implementation properly connects to InMemoryKV store
The verification confirms that:
MonitoringModelStore
retrieves joins from a cached configuration registry that's backed by the KVStoreInMemoryKVStoreModule
correctly sets up and injects the InMemoryKV store implementation- The
getJoins
method properly transforms the stored thrift join configurations into the required format
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for MonitoringModelStore implementation
ast-grep --pattern 'class MonitoringModelStore {
$$$
getJoins($$$) {
$$$
}
$$$
}'
# Check for InMemoryKV store usage
rg -l "InMemoryKV"
Length of output: 795
Script:
#!/bin/bash
# Search for MonitoringModelStore class definition and implementation
rg -A 10 "class MonitoringModelStore"
# Check InMemoryKVStoreModule implementation
cat hub/app/module/InMemoryKVStoreModule.scala
# Search for MonitoringModelStore instantiation
rg "new MonitoringModelStore"
Length of output: 2687
Script:
#!/bin/bash
# Search for getJoins implementation in MonitoringModelStore
rg -A 10 "def getJoins" hub/app/store/MonitoringModelStore.scala
# Search for any join-related operations in KVStore
rg "def.*[Jj]oin" --type scala
Length of output: 17084
hub/app/controllers/SearchController.scala (3)
5-8
: LGTM: Import changes align with architectural transition
The import changes correctly reflect the transition from model-based to join-based architecture.
38-38
: LGTM: Response handling properly updated
The response handling correctly uses the new SearchJoinResponse
type while maintaining the existing pagination logic.
15-15
: Verify consistent store implementation usage across controllers
The change from DynamoDBMonitoringStore
to MonitoringModelStore
is a good abstraction. Let's verify this pattern is consistently applied.
✅ Verification successful
Let me gather more information to verify if there are other controllers that should be using MonitoringModelStore.
Based on the results, I can now generate the final response as we have enough information about the store usage patterns across controllers.
Store implementation change is consistently applied
The migration from DynamoDBMonitoringStore
to MonitoringModelStore
is consistently implemented across all relevant controllers:
- JoinController
- ModelController
- SearchController
Other controllers (TimeSeriesController, InMemKVStoreController) use different store types (DriftStore
, KVStore
) for their specific purposes, which is correct as they serve different functionality.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any remaining DynamoDBMonitoringStore usage in controllers
# and verify MonitoringModelStore is used consistently
echo "Checking for remaining DynamoDBMonitoringStore usage in controllers:"
rg "DynamoDBMonitoringStore" "hub/app/controllers/"
echo -e "\nVerifying MonitoringModelStore usage pattern:"
ast-grep --pattern 'class $_ @Inject() ($$$, monitoringStore: MonitoringModelStore)'
Length of output: 2919
Script:
#!/bin/bash
# Find all controller files to analyze the dependency pattern
echo "Finding all controllers:"
fd "Controller.scala$" hub/app/controllers/
# Check for any monitoring store related patterns in controllers
echo -e "\nChecking monitoring store patterns in controllers:"
rg -A 2 "Store" hub/app/controllers/
Length of output: 5959
hub/test/controllers/ModelControllerSpec.scala (2)
18-18
: LGTM! Store type changes align with architectural shift
The transition from DynamoDBMonitoringStore
to MonitoringModelStore
aligns with the PR's objective of moving away from DynamoDB to an InMemoryKV store. The test cases remain valid as they focus on controller behavior rather than store implementation.
Also applies to: 27-27
27-27
: Verify interface compatibility between store implementations
Let's ensure the new MonitoringModelStore
maintains the same interface as the previous DynamoDBMonitoringStore
to prevent any breaking changes.
hub/test/store/MonitoringModelStoreTest.scala (2)
24-24
: LGTM! Class rename aligns with architectural changes.
The rename from DynamoDBMonitoringStoreTest
to MonitoringModelStoreTest
appropriately reflects the shift from DynamoDB to a more generic store implementation.
Line range hint 24-89
: Verify the test coverage for new store implementation.
The test appears to focus on configuration loading and model validation. Given the architectural shift from DynamoDB to InMemoryKV store, we should verify that all critical paths are tested.
Let's check for any missing test coverage:
✅ Verification successful
Let me gather more information about the implementation and test coverage.
Let me check for more details about the public interface and configuration registry.
Based on the gathered information, I can now provide the final response:
Test coverage appears adequate for the current implementation
The MonitoringModelStore has a minimal public interface with just two main methods:
getConfigRegistry
: Tested through multiple assertions on models, groupBys, joins, and stagingQueriesgetModels
: Explicitly tested with detailed validation of model properties
The implementation uses a TTL cache for configuration and handles the DynamoDB to InMemoryKV store transition internally. The test covers both the loading of configurations and validation of the loaded data, which are the critical paths for this store implementation.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check test coverage for the new MonitoringModelStore implementation
# Find the main store implementation
echo "Looking for MonitoringModelStore implementation..."
rg -l "class MonitoringModelStore"
# Check for additional test files
echo "Looking for related test files..."
fd -e scala -e java . | rg -i "test.*monitoring.*store"
# Look for untested public methods
echo "Looking for potentially untested public methods..."
ast-grep --pattern 'class MonitoringModelStore {
$$$
def $name($$$): $ret {
$$$
}
$$$
}'
Length of output: 629
Script:
#!/bin/bash
# Get the implementation details of MonitoringModelStore
echo "=== MonitoringModelStore Implementation ==="
rg -A 5 "class MonitoringModelStore" hub/app/store/MonitoringModelStore.scala
# Find all public methods in the implementation
echo -e "\n=== Public Methods in MonitoringModelStore ==="
rg "def \w+\(" hub/app/store/MonitoringModelStore.scala
# Find test method calls to these public methods
echo -e "\n=== Test Coverage in MonitoringModelStoreTest ==="
rg "dynamoDBMonitoringStore\.\w+" hub/test/store/MonitoringModelStoreTest.scala
Length of output: 1321
Script:
#!/bin/bash
# Get more context around the MonitoringModelStore implementation
echo "=== Full MonitoringModelStore Class ==="
rg -A 20 "class MonitoringModelStore" hub/app/store/MonitoringModelStore.scala
# Look for ConfigRegistry related code
echo -e "\n=== ConfigRegistry Implementation ==="
rg "ConfigRegistry" -A 5 hub/app/store/MonitoringModelStore.scala
# Check for any initialization or setup methods
echo -e "\n=== Initialization Methods ==="
rg "def (init|load|setup|refresh)" hub/app/store/MonitoringModelStore.scala
Length of output: 1355
hub/conf/routes (2)
7-8
: Clarify the transition plan for model timeseries endpoint.
The endpoint is marked as TBD, but there's no clear indication of:
- The replacement approach for model predictions and drift
- Timeline for implementing the new approach
- Impact on existing consumers of this API
Would you like me to help create a GitHub issue to track this transition?
4-4
: LGTM! Verify JoinController implementation.
The new joins endpoint follows RESTful conventions and maintains consistency with existing list endpoints by supporting pagination.
✅ Verification successful
JoinController implementation verified and properly handles pagination
The controller exists and is properly implemented with:
- Input validation for offset and limit parameters
- Appropriate error responses for invalid inputs
- Default values and max limit handling
- Unit tests are also present in
JoinControllerSpec.scala
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Verify the JoinController implementation exists
ast-grep --pattern 'class JoinController {
$$$
def list(offset: Option[Int], limit: Option[Int]) {
$$$
}
}'
Length of output: 119
Script:
#!/bin/bash
# Let's try a broader search for JoinController using ripgrep
rg "class JoinController" -A 10
# Also search for the file itself
fd "JoinController.scala"
Length of output: 1956
Script:
#!/bin/bash
# Let's examine the full implementation of the list method
rg "def list" -A 10 hub/app/controllers/JoinController.scala
Length of output: 597
spark/src/main/scala/ai/chronon/spark/scripts/ObservabilityDemo.scala (1)
5-5
: Verify complete removal of drift-related imports
The addition of HTTPKVStore import aligns with the PR's objective of connecting with Hub. However, let's ensure all drift-related imports have been properly cleaned up.
✅ Verification successful
Let me verify if there are any remaining drift-related imports in the ObservabilityDemo.scala file specifically.
Drift-related imports have been properly cleaned up
The verification confirms that there are no remaining drift-related imports or references in ObservabilityDemo.scala. The import changes are clean and align with the PR's objectives of removing drift-related functionality.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for any remaining drift-related imports that might have been missed
rg "import.*Drift" --type scala
Length of output: 2622
Script:
#!/bin/bash
# Search for any remaining drift-related imports or usages in ObservabilityDemo.scala
rg "Drift" spark/src/main/scala/ai/chronon/spark/scripts/ObservabilityDemo.scala
Length of output: 80
.github/workflows/test_scala_and_python.yaml (1)
183-184
: LGTM! Integration follows established patterns
The new hub tests step:
- Maintains consistency with other test configurations
- Uses the same SBT version (2.12.18)
- Follows the same memory configuration pattern
Also applies to: 185-188
hub/app/store/MonitoringModelStore.scala (3)
34-37
: LGTM! Good architectural change.
The renaming from DynamoDBMonitoringStore to MonitoringModelStore and the corresponding variable changes make the implementation more generic and not tied to a specific storage backend. This aligns well with the PR objectives of transitioning to an InMemoryKV store.
62-62
: Verify the impact of output columns change.
The change to use getOrElse("derivations", Array.empty)
instead of flattening all values might exclude output columns from other groups. Please verify this is intentional and doesn't break existing functionality.
95-95
: Verify bulk operation handling in InMemoryKV store.
Given the PR objectives mention previous performance issues with DynamoDB ingestion, please verify that the new InMemoryKV store properly handles bulk operations without rejected execution exceptions.
✅ Verification successful
Let me gather more information about the InMemoryKV store implementation.
Let me try a different approach to locate the InMemoryKvStore implementation.
Let me check the specific list implementation in the InMemoryKvStore.
Based on the implementation details of InMemoryKvStore, I can now provide the final response:
InMemoryKV store is safe for bulk operations
The InMemoryKvStore implementation is well-suited for bulk operations because:
- It uses ConcurrentHashMap for thread-safe operations
- The bulkPut method is specifically designed for batch uploads
- There are no throttling or rate limits that could cause rejected execution exceptions
- Operations are performed in-memory without the network/IO constraints that affected DynamoDB
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for InMemoryKV store implementation details
rg -A 10 "class.*InMemoryKV.*Store"
# Look for any bulk operation handling or rate limiting
rg -l "bulk|batch|throttle|limit" --type scala
Length of output: 11975
Script:
#!/bin/bash
# Search for InMemoryKvStore implementation details
ast-grep --pattern 'class InMemoryKvStore {
$$$
}'
# Look for list method implementation in InMemoryKvStore
rg -A 10 "def list.*InMemoryKvStore"
# Check for any rate limiting or bulk operation handling in InMemoryKvStore
rg -A 10 "throttle|limit|bulk|batch" --type scala "InMemoryKvStore"
Length of output: 217
Script:
#!/bin/bash
# Search for InMemoryKvStore implementation with correct casing
rg -A 20 "class.*InMemoryKvStore" --type scala
# Look for the list method implementation
rg -A 10 "def list.*KVStore" --type scala
# Search in utils directory since previous output showed InMemoryKvStore in spark/utils
fd -t f "InMemoryKvStore.scala"
Length of output: 2521
Script:
#!/bin/bash
# Get the full content of InMemoryKvStore.scala to examine list implementation
cat spark/src/main/scala/ai/chronon/spark/utils/InMemoryKvStore.scala
# Look for any test files that verify bulk operations
rg -l "InMemoryKvStore.*list" --type scala
Length of output: 6241
online/src/main/scala/ai/chronon/online/stats/DriftStore.scala (1)
201-223
: Verify integration points with Spark and frontend.
Since this code connects drift metrics computation between Spark and the frontend, please ensure:
- These percentile values match the ones used in Spark computations
- The frontend is prepared to handle this fixed set of percentiles
- Performance impact of processing all these percentiles has been evaluated
Let's verify the consistency across the codebase:
build.sbt (2)
81-85
: LGTM! Well-structured circe dependency definition.
The grouping of circe dependencies with a shared version variable follows good practices for dependency management.
251-251
: Verify completeness of hub dependencies for new features.
The addition of spark and circe dependencies aligns with the PR objectives for drift metrics computation and JSON handling. However, let's verify that all required dependencies for the new InMemoryKV store and HTTP endpoints are included.
Also applies to: 262-262
✅ Verification successful
All required dependencies are present for new features
Based on the imports analysis:
- Spark dependencies are correctly included, used by
InMemoryKvStore
inInMemoryKVStoreModule
- Circe dependencies are present and used across controllers for JSON handling
- Play Framework dependencies are properly imported for HTTP endpoints
- AWS dependencies are available for DynamoDB integration
- All necessary testing dependencies (mockito, scalatest) are included
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify all required dependencies are declared for new features
# Check for InMemoryKV and HTTP endpoint related imports in hub
echo "Checking for imports in hub module..."
rg -t scala "^import" hub/app/
# Check for any potential missing dependencies in hub tests
echo "Checking for imports in hub tests..."
rg -t scala "^import" hub/test/
Length of output: 14769
spark/src/main/scala/ai/chronon/spark/Driver.scala (2)
21-21
: LGTM!
The import statement is correctly placed and necessary for the MetadataDataset constant usage.
569-569
: Great improvement using MetadataDataset constant!
Replacing the string literal "ZIPLINE_METADATA" with the MetadataDataset constant is a good practice as it:
- Centralizes the dataset name definition
- Reduces the risk of typos
- Makes maintenance easier
hub/app/controllers/InMemKVStoreController.scala (1)
32-33
: Ensure consistent character encoding
Using bytes.utf8String
assumes that the request body is encoded in UTF-8, which might not always be the case.
Verify that all clients sending requests to this endpoint use UTF-8 encoding. If not, consider specifying the character encoding explicitly or handling different encodings as needed.
hub/app/controllers/TimeSeriesController.scala (3)
149-149
: Double-check the restriction on raw granularity for drift metrics
The message "We don't support Raw granularity for drift metric types" may limit future functionality. Confirm whether this restriction is necessary, and if not, consider implementing support for raw granularity.
Would you like assistance in evaluating support for raw granularity?
53-54
:
Update all calls to fetchFeature
to include the new join
parameter
The fetchFeature
method now requires a join
parameter. Ensure that all existing calls to this method have been updated to include the new parameter to prevent compilation errors or unexpected behavior.
Run the following script to identify method calls to fetchFeature
that may need updating:
✅ Verification successful
All calls to fetchFeature
already include the new join
parameter
Based on the search results, all invocations of fetchFeature
in the codebase are in the test file TimeSeriesControllerSpec.scala
, and they all correctly include the join
parameter as the first argument with the value "my_join"
. No updates are needed.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Find all invocations of 'fetchFeature' that may not include the new 'join' parameter.
# Search for invocations of 'fetchFeature' in Scala code
rg --type scala 'fetchFeature\s*\('
Length of output: 2053
26-28
:
Ensure proper injection of DriftStore
and update instantiations accordingly
The TimeSeriesController
constructor now includes DriftStore
and an implicit ExecutionContext
. Verify that all instances where TimeSeriesController
is instantiated, including in tests and configuration files, have been updated to include these new parameters to prevent runtime errors.
Run the following script to identify instantiations of TimeSeriesController
that may need updating:
✅ Verification successful
Let me check the test file to verify if the instantiation is properly handling the implicit ExecutionContext.
The instantiation of TimeSeriesController
is properly handled
The test file shows that the TimeSeriesController
is correctly instantiated with all required parameters:
- The
ControllerComponents
is provided viastubCC
- The
DriftStore
is provided viamockedStore
- The
ExecutionContext
is properly defined as an implicit value
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Find all instantiations of TimeSeriesController that may not include the new parameters.
# Search for instantiations of TimeSeriesController in Scala files
rg --type scala 'new\s+TimeSeriesController\s*\('
Length of output: 168
Script:
#!/bin/bash
# Check the content of the test file to verify the ExecutionContext handling
cat hub/test/controllers/TimeSeriesControllerSpec.scala
Length of output: 14201
hub/test/controllers/TimeSeriesControllerSpec.scala (1)
310-310
: Verify the structure of histogramMap
At line 310, when constructing histogramMap
, ensure that it aligns with the expected structure required by tileSummarySeries.setHistogram
. Misalignment may lead to unexpected behavior in tests.
Please confirm that the map keys and values match the expected types and formats.
# Kick off the ObsDemo spark job in the app container | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add shell script essentials
The script is missing crucial shell script components:
- Shebang directive
- Execution permissions
- Usage instructions and requirements
Apply this diff to improve the script header:
+#!/bin/bash
+
+# Script: load_summaries.sh
+# Purpose: Initiates ObservabilityDemo Spark job in the app container
+#
+# Requirements:
+# - Docker and docker-compose must be running
+# - Containers must be up (see docker-init/demo/README.md)
+#
+# Usage: ./load_summaries.sh
+
# Kick off the ObsDemo spark job in the app container
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# Kick off the ObsDemo spark job in the app container | |
#!/bin/bash | |
# Script: load_summaries.sh | |
# Purpose: Initiates ObservabilityDemo Spark job in the app container | |
# | |
# Requirements: | |
# - Docker and docker-compose must be running | |
# - Containers must be up (see docker-init/demo/README.md) | |
# | |
# Usage: ./load_summaries.sh | |
# Kick off the ObsDemo spark job in the app container | |
🧰 Tools
🪛 Shellcheck (0.10.0)
[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.
(SC2148)
package module | ||
|
||
import ai.chronon.online.KVStore | ||
import ai.chronon.online.stats.DriftStore | ||
import com.google.inject.AbstractModule | ||
|
||
import javax.inject.Inject | ||
import javax.inject.Provider | ||
|
||
class DriftStoreModule extends AbstractModule { | ||
|
||
override def configure(): Unit = { | ||
// TODO swap to concrete api impl in a follow up | ||
bind(classOf[DriftStore]).toProvider(classOf[DriftStoreProvider]).asEagerSingleton() | ||
} | ||
} | ||
|
||
class DriftStoreProvider @Inject() (kvStore: KVStore) extends Provider[DriftStore] { | ||
override def get(): DriftStore = { | ||
new DriftStore(kvStore) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
DynamoDB dependencies still present in ModelStoreModule
The transition from DynamoDB to InMemoryKV is incomplete. While DriftStoreModule has been updated, ModelStoreModule still uses DynamoDB:
hub/app/module/ModelStoreModule.scala
is still creating and bindingAwsApiImpl
- DynamoDB related constants and metrics are present in
MonitoringModelStore.scala
- Test files still reference DynamoDB implementation
🔗 Analysis chain
Verify DynamoDB module replacement.
Since this module replaces the DynamoDB implementation, we should verify the configuration changes.
I'll search for any remaining DynamoDB or AWS related configurations to ensure complete transition.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if DynamoDB module is properly disabled and replaced
rg -g "*.conf" "DynamoDBModule|DriftStoreModule" hub/conf/
Length of output: 136
Script:
#!/bin/bash
# Let's check for any remaining DynamoDB related configurations or modules
rg -g '*.{conf,scala}' -i 'dynamodb|aws' hub/
Length of output: 1753
inMemoryKVStore.create(Constants.MetadataDataset) | ||
inMemoryKVStore.create(Constants.TiledSummaryDataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling for dataset creation.
The dataset creation calls could fail silently. Consider:
- Adding error handling
- Logging the creation status
- Ensuring idempotent creation
- inMemoryKVStore.create(Constants.MetadataDataset)
- inMemoryKVStore.create(Constants.TiledSummaryDataset)
+ try {
+ inMemoryKVStore.create(Constants.MetadataDataset)
+ inMemoryKVStore.create(Constants.TiledSummaryDataset)
+ logger.info("Successfully created KV store datasets")
+ } catch {
+ case e: Exception =>
+ logger.error("Failed to create KV store datasets", e)
+ throw e
+ }
Committable suggestion skipped: line range outside the PR's diff.
class InMemoryKVStoreModule extends AbstractModule { | ||
|
||
override def configure(): Unit = { | ||
val inMemoryKVStore = InMemoryKvStore.build("hub", () => null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Passing null as TableUtils provider is unsafe and needs to be fixed
The tableUtils
parameter is used in the bulkPut
method where it's directly invoked, which would lead to a NullPointerException. While other methods might work without TableUtils, providing null is dangerous because:
- The
bulkPut
method explicitly callstableUtils()
and uses its return value - The code comment indicates TableUtils contains SparkContext which is needed for executors
- Other test files properly initialize it with a SparkSession (e.g., in OnlineUtils.scala)
Locations that need attention:
hub/app/module/InMemoryKVStoreModule.scala
: Replace() => null
with proper TableUtils initializationspark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala
: Also using null unsafely
🔗 Analysis chain
Review the null function provider.
The () => null
function provider could lead to runtime issues if the underlying implementation attempts to use this provider. Consider providing a meaningful default or documenting why null is safe here.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for usages of the value provider in InMemoryKvStore
ast-grep --pattern 'class InMemoryKvStore {
$$$
build($_, $provider) {
$$$
}
$$$
}'
Length of output: 97
Script:
#!/bin/bash
# Let's try to find the InMemoryKvStore implementation using ripgrep
rg -A 10 "class InMemoryKvStore"
# Also search for any references to InMemoryKvStore.build
rg "InMemoryKvStore.build"
# Search for files containing InMemoryKvStore to ensure we're not missing anything
fd "InMemoryKvStore"
Length of output: 1878
Script:
#!/bin/bash
# Let's examine the InMemoryKvStore implementation more closely
cat spark/src/main/scala/ai/chronon/spark/utils/InMemoryKvStore.scala
# Also check the companion object where build method should be defined
rg -A 10 "object InMemoryKvStore"
Length of output: 7580
class InMemoryKVStoreModule extends AbstractModule { | ||
|
||
override def configure(): Unit = { | ||
val inMemoryKVStore = InMemoryKvStore.build("hub", () => null) | ||
// create relevant datasets in kv store | ||
inMemoryKVStore.create(Constants.MetadataDataset) | ||
inMemoryKVStore.create(Constants.TiledSummaryDataset) | ||
bind(classOf[KVStore]).toInstance(inMemoryKVStore) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
The lifecycle management concern is valid and needs attention
The verification reveals that:
InMemoryKvStore
usesConcurrentHashMap
for storage but lacks cleanup mechanisms- The store is singleton-like with a static
stores
map that keeps references to all instances - Neither the
KVStore
trait nor the implementation provides cleanup/shutdown methods - The store is used across multiple test files without clear cleanup patterns
Key issues:
- Memory leaks: Static
stores
map inInMemoryKvStore
object retains all instances - No cleanup API: Missing methods to clear data between tests
- Resource management: No way to release resources when store is no longer needed
🔗 Analysis chain
Consider adding lifecycle management.
The in-memory store lacks explicit cleanup mechanisms. This could lead to:
- Memory leaks in long-running tests
- Data bleeding between tests
- Resource management issues
Consider implementing a cleanup mechanism or documenting the lifecycle management approach.
Let me gather more information about the KVStore implementation and its usage in tests.
Let me check one more aspect - the KVStore interface and its implementation details.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for potential test files using this module
rg -l "InMemoryKVStoreModule"
# Check if there's any existing cleanup mechanism
ast-grep --pattern 'class InMemoryKvStore {
$$$
(cleanup|close|dispose|shutdown)($$$) {
$$$
}
$$$
}'
Length of output: 213
Script:
#!/bin/bash
# Check the InMemoryKvStore implementation
rg -A 5 "class InMemoryKvStore"
# Look for test files that might use KVStore
fd -e scala test
rg -l "KVStore.*test"
# Check if there's any cleanup in test configurations
rg -A 5 "beforeEach|afterEach|beforeAll|afterAll.*KVStore"
Length of output: 9522
Script:
#!/bin/bash
# Check the KVStore interface definition
rg -A 10 "trait KVStore"
# Look for any cleanup/shutdown patterns in the actual implementation
rg -A 10 "object InMemoryKvStore"
# Check if there are any memory management related imports or patterns
rg "import.*concurrent.*Map|ConcurrentHashMap|WeakHashMap" spark/src/main/scala/ai/chronon/spark/utils/InMemoryKvStore.scala
Length of output: 3804
def bulkPut(): mvc.Action[RawBuffer] = | ||
Action(parse.raw).async { request => | ||
request.body.asBytes() match { | ||
case Some(bytes) => | ||
decode[Array[PutRequest]](bytes.utf8String) match { | ||
case Right(putRequests) => | ||
logger.info(s"Attempting a bulkPut with ${putRequests.length} items") | ||
val resultFuture = kvStore.multiPut(putRequests) | ||
resultFuture.map { responses => | ||
if (responses.contains(false)) { | ||
logger.warn("Some write failures encountered") | ||
} | ||
Ok("Success") | ||
} | ||
case Left(error) => Future.successful(BadRequest(error.getMessage)) | ||
} | ||
case None => Future.successful(BadRequest("Empty body")) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve error handling and response status codes in bulkPut
method
Currently, the bulkPut
method returns an Ok("Success")
response even if some writes fail during kvStore.multiPut(putRequests)
. This could lead to clients believing that all data was successfully written when it wasn't.
Consider modifying the response to accurately reflect the success or failure of the operation. Here's a suggested change:
resultFuture.map { responses =>
if (responses.forall(identity)) {
Ok("Success")
} else {
logger.warn("Some write failures encountered")
- Ok("Success")
+ InternalServerError("Some writes failed during bulkPut")
}
}
Alternatively, you could return detailed information about which writes failed to provide more transparency to the client.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def bulkPut(): mvc.Action[RawBuffer] = | |
Action(parse.raw).async { request => | |
request.body.asBytes() match { | |
case Some(bytes) => | |
decode[Array[PutRequest]](bytes.utf8String) match { | |
case Right(putRequests) => | |
logger.info(s"Attempting a bulkPut with ${putRequests.length} items") | |
val resultFuture = kvStore.multiPut(putRequests) | |
resultFuture.map { responses => | |
if (responses.contains(false)) { | |
logger.warn("Some write failures encountered") | |
} | |
Ok("Success") | |
} | |
case Left(error) => Future.successful(BadRequest(error.getMessage)) | |
} | |
case None => Future.successful(BadRequest("Empty body")) | |
} | |
} | |
def bulkPut(): mvc.Action[RawBuffer] = | |
Action(parse.raw).async { request => | |
request.body.asBytes() match { | |
case Some(bytes) => | |
decode[Array[PutRequest]](bytes.utf8String) match { | |
case Right(putRequests) => | |
logger.info(s"Attempting a bulkPut with ${putRequests.length} items") | |
val resultFuture = kvStore.multiPut(putRequests) | |
resultFuture.map { responses => | |
if (responses.forall(identity)) { | |
Ok("Success") | |
} else { | |
logger.warn("Some write failures encountered") | |
InternalServerError("Some writes failed during bulkPut") | |
} | |
} | |
case Left(error) => Future.successful(BadRequest(error.getMessage)) | |
} | |
case None => Future.successful(BadRequest("Empty body")) | |
} | |
} |
private def convertTileDriftSeriesInfoToTimeSeries(tileDriftSeries: TileDriftSeries, | ||
metric: Metric): FeatureTimeSeries = { | ||
val lhsList = if (metric == NullMetric) { | ||
tileDriftSeries.nullRatioChangePercentSeries.asScala | ||
} else { | ||
val featureTsJson = if (granularity == Raw) { | ||
val featureTs = ComparedFeatureTimeSeries(name, | ||
generateMockRawTimeSeriesPoints(startTs, 100), | ||
generateMockRawTimeSeriesPoints(startTs, 100)) | ||
featureTs.asJson.noSpaces | ||
// check if we have a numeric / categorical feature. If the percentile drift series has non-null doubles | ||
// then we have a numeric feature at hand | ||
val isNumeric = | ||
tileDriftSeries.percentileDriftSeries.asScala != null && tileDriftSeries.percentileDriftSeries.asScala | ||
.exists(_ != null) | ||
if (isNumeric) tileDriftSeries.percentileDriftSeries.asScala | ||
else tileDriftSeries.histogramDriftSeries.asScala | ||
} | ||
val points = lhsList.zip(tileDriftSeries.timestamps.asScala).map { | ||
case (v, ts) => TimeSeriesPoint(v, ts) | ||
} | ||
|
||
FeatureTimeSeries(tileDriftSeries.getKey.getColumn, points) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add null checks to prevent NullPointerException
in data conversion methods
In convertTileDriftSeriesInfoToTimeSeries
, accessing properties like tileDriftSeries.percentileDriftSeries
and tileDriftSeries.histogramDriftSeries
without null checks may lead to NullPointerException
. Ensure that null checks are in place or use Option
types to safely handle potential null values.
Apply this diff to handle null values safely:
- val isNumeric =
- tileDriftSeries.percentileDriftSeries.asScala != null && tileDriftSeries.percentileDriftSeries.asScala
- .exists(_ != null)
+ val percentileSeries = Option(tileDriftSeries.percentileDriftSeries).map(_.asScala).getOrElse(Seq.empty)
+ val histogramSeries = Option(tileDriftSeries.histogramDriftSeries).map(_.asScala).getOrElse(Seq.empty)
+ val isNumeric = percentileSeries.nonEmpty && percentileSeries.exists(_ != null)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
private def convertTileDriftSeriesInfoToTimeSeries(tileDriftSeries: TileDriftSeries, | |
metric: Metric): FeatureTimeSeries = { | |
val lhsList = if (metric == NullMetric) { | |
tileDriftSeries.nullRatioChangePercentSeries.asScala | |
} else { | |
val featureTsJson = if (granularity == Raw) { | |
val featureTs = ComparedFeatureTimeSeries(name, | |
generateMockRawTimeSeriesPoints(startTs, 100), | |
generateMockRawTimeSeriesPoints(startTs, 100)) | |
featureTs.asJson.noSpaces | |
// check if we have a numeric / categorical feature. If the percentile drift series has non-null doubles | |
// then we have a numeric feature at hand | |
val isNumeric = | |
tileDriftSeries.percentileDriftSeries.asScala != null && tileDriftSeries.percentileDriftSeries.asScala | |
.exists(_ != null) | |
if (isNumeric) tileDriftSeries.percentileDriftSeries.asScala | |
else tileDriftSeries.histogramDriftSeries.asScala | |
} | |
val points = lhsList.zip(tileDriftSeries.timestamps.asScala).map { | |
case (v, ts) => TimeSeriesPoint(v, ts) | |
} | |
FeatureTimeSeries(tileDriftSeries.getKey.getColumn, points) | |
} | |
private def convertTileDriftSeriesInfoToTimeSeries(tileDriftSeries: TileDriftSeries, | |
metric: Metric): FeatureTimeSeries = { | |
val lhsList = if (metric == NullMetric) { | |
tileDriftSeries.nullRatioChangePercentSeries.asScala | |
} else { | |
// check if we have a numeric / categorical feature. If the percentile drift series has non-null doubles | |
// then we have a numeric feature at hand | |
val percentileSeries = Option(tileDriftSeries.percentileDriftSeries).map(_.asScala).getOrElse(Seq.empty) | |
val histogramSeries = Option(tileDriftSeries.histogramDriftSeries).map(_.asScala).getOrElse(Seq.empty) | |
val isNumeric = percentileSeries.nonEmpty && percentileSeries.exists(_ != null) | |
if (isNumeric) percentileSeries | |
else histogramSeries | |
} | |
val points = lhsList.zip(tileDriftSeries.timestamps.asScala).map { | |
case (v, ts) => TimeSeriesPoint(v, ts) | |
} | |
FeatureTimeSeries(tileDriftSeries.getKey.getColumn, points) | |
} |
if (metric == NullMetric) { | ||
summarySeries.nullCount.asScala.zip(summarySeries.timestamps.asScala).map { | ||
case (nullCount, ts) => TimeSeriesPoint(0, ts, nullValue = Some(nullCount.intValue())) | ||
} | ||
} else { | ||
// check if we have a numeric / categorical feature. If the percentile drift series has non-null doubles | ||
// then we have a numeric feature at hand | ||
val isNumeric = summarySeries.percentiles.asScala != null && summarySeries.percentiles.asScala.exists(_ != null) | ||
if (isNumeric) { | ||
summarySeries.percentiles.asScala.zip(summarySeries.timestamps.asScala).flatMap { | ||
case (percentiles, ts) => | ||
DriftStore.percentileLabels.zip(percentiles.asScala).map { | ||
case (l, value) => TimeSeriesPoint(value, ts, Some(l)) | ||
} | ||
} | ||
} else { | ||
val featuresTs = FeatureTimeSeries(name, generateMockTimeSeriesPercentilePoints(startTs, endTs)) | ||
featuresTs.asJson.noSpaces | ||
summarySeries.timestamps.asScala.zipWithIndex.flatMap { | ||
case (ts, idx) => | ||
summarySeries.histogram.asScala.map { | ||
case (label, values) => | ||
TimeSeriesPoint(values.get(idx).toDouble, ts, Some(label)) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure safe handling of potential null values in convertTileSummarySeriesToTimeSeries
Similar to the previous method, convertTileSummarySeriesToTimeSeries
may encounter null values when accessing summarySeries.percentiles
and summarySeries.histogram
. Add null checks or use Option
types to prevent NullPointerException
.
Apply this diff to handle null values safely:
- val isNumeric = summarySeries.percentiles.asScala != null && summarySeries.percentiles.asScala.exists(_ != null)
+ val percentiles = Option(summarySeries.percentiles).map(_.asScala).getOrElse(Seq.empty)
+ val histogram = Option(summarySeries.histogram).map(_.asScala).getOrElse(Map.empty)
+ val isNumeric = percentiles.nonEmpty && percentiles.exists(_ != null)
Committable suggestion skipped: line range outside the PR's diff.
val isNumeric = if (feature % 2 == 0) true else false | ||
val percentileDrifts = if (isNumeric) List.fill(timestamps.size())(JDouble.valueOf(0.12)).asJava else List.fill[JDouble](timestamps.size())(null).asJava | ||
val histogramDrifts = if (isNumeric) List.fill[JDouble](timestamps.size())(null).asJava else List.fill(timestamps.size())(JDouble.valueOf(0.23)).asJava | ||
val nullRationChangePercents = List.fill(timestamps.size())(JDouble.valueOf(0.25)).asJava |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in variable name nullRationChangePercents
There's a typographical error in the variable name nullRationChangePercents
at line 275. It should be nullRatioChangePercents
.
Apply this diff to correct the typo:
- val nullRationChangePercents = List.fill(timestamps.size())(JDouble.valueOf(0.25)).asJava
+ val nullRatioChangePercents = List.fill(timestamps.size())(JDouble.valueOf(0.25)).asJava
Also, update its usage at line 278:
tileDriftSeries.setPercentileDriftSeries(percentileDrifts)
- tileDriftSeries.setNullRatioChangePercentSeries(nullRationChangePercents)
+ tileDriftSeries.setNullRatioChangePercentSeries(nullRatioChangePercents)
tileDriftSeries.setHistogramDriftSeries(histogramDrifts)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
val nullRationChangePercents = List.fill(timestamps.size())(JDouble.valueOf(0.25)).asJava | |
val nullRatioChangePercents = List.fill(timestamps.size())(JDouble.valueOf(0.25)).asJava |
val tileKey = new TileSeriesKey() | ||
tileKey.setNodeName(join) | ||
tileKey.setGroupName(groupBy) | ||
tileKey.setNodeName(join) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate setNodeName
call on tileKey
In the generateSummarySeries
method, tileKey.setNodeName(join)
is called twice (lines 286 and 288). The second call at line 288 is redundant or may be a mistake where a different property was intended to be set.
Apply this diff to remove the redundant call:
val tileKey = new TileSeriesKey()
tileKey.setNodeName(join)
tileKey.setGroupName(groupBy)
- tileKey.setNodeName(join)
tileKey.setColumn(featureName)
If a different property was intended, please correct it accordingly.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
tileKey.setNodeName(join) | |
tileKey.setColumn(featureName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (7)
hub/app/controllers/JoinController.scala (1)
11-17
: Enhance class documentationWhile the basic documentation is present, consider adding more details about:
- The purpose and functionality of the Paginate trait
- Expected behavior when store operations fail
- Response format examples
/** * Controller for the Zipline Join entities + * + * This controller provides REST endpoints for managing Zipline Join entities. + * It implements pagination through the Paginate trait and uses MonitoringModelStore + * for data persistence. + * + * Response Format: + * - List: { "offset": number, "results": Join[] } + * - Get: Join object + * + * @see MonitoringModelStore for storage implementation + * @see Paginate for pagination functionality */hub/app/model/Model.scala (3)
5-10
: Consider reducing data duplication and adding field validationThe Join case class now mirrors several fields from the Model class (online, production, team). While this provides flexibility, it might lead to data inconsistency.
Consider:
- Adding field validation for the team field to ensure standardization
- Evaluating if these fields could be consolidated into a shared trait/class to avoid duplication
- Adding documentation to clarify the relationship between Join and Model fields
+/** Represents configuration for feature joins with metadata about deployment status and ownership */ case class Join(name: String, joinFeatures: Seq[String], groupBys: Seq[GroupBy], online: Boolean, production: Boolean, team: String)
13-19
: Enhance documentation format and clarityWhile the workflow documentation is helpful, it could be improved for better readability and documentation generation.
Consider reformatting as:
-// 1.) metadataUpload: join -> map<gb, list<cols>> -// 2.) fetchJoinConf + listColumns: join => list<column> -// 3.) (columns, start, end) -> list<tileKey> - -// 4.) 1:n/fetchTile: tileKey -> TileSummaries -// 5.) 1:n:n/compareTiles: TileSummaries, TileSummaries -> TileDrift -// 6.) Map[column, Seq[tileDrift]] -> TimeSeriesController +/** Data Processing Workflow: + * 1. Metadata Upload: Processes join configuration into grouped column mappings + * Input: join -> Output: map<groupBy, List<columns>> + * + * 2. Configuration Fetch: Retrieves and lists columns for a join + * Input: join -> Output: List<column> + * + * 3. Tile Key Generation: Generates tile keys for given columns and time range + * Input: (columns, startTime, endTime) -> Output: List<tileKey> + * + * 4. Tile Fetching (1:n): Retrieves summaries for each tile + * Input: tileKey -> Output: TileSummaries + * + * 5. Tile Comparison (1:n:n): Computes drift between tile summaries + * Input: (TileSummaries, TileSummaries) -> Output: TileDrift + * + * 6. Time Series Generation: Prepares data for TimeSeriesController + * Input: Map[column, Seq[tileDrift]] -> Output: TimeSeriesController data + */
64-67
: LGTM! Consider adding documentation for response classesThe response class structure is consistent and the naming better reflects the functionality.
Consider adding documentation to clarify the usage of each response type:
+/** Response for listing available models with pagination support */ case class ListModelResponse(offset: Int, items: Seq[Model]) +/** Response for searching joins with pagination support */ case class SearchJoinResponse(offset: Int, items: Seq[Join]) +/** Response for listing available joins with pagination support */ case class ListJoinResponse(offset: Int, items: Seq[Join])hub/conf/routes (1)
17-17
: Consider refactoring the complex parameter list.While the endpoint correctly incorporates the join parameter, the large number of query parameters suggests potential for improvement.
Consider these alternatives:
- Use a POST endpoint with a request body for complex queries
- Group related parameters into meaningful path segments
Example refactor:
-GET /api/v1/join/:join/feature/:name/timeseries +POST /api/v1/join/:join/feature/:name/timeseriesWith request body:
{ "timeRange": { "start": "timestamp", "end": "timestamp" }, "metrics": { "type": "drift", "names": ["metric1", "metric2"], "granularity": "hour" }, "options": { "offset": "value", "algorithm": "name" } }hub/app/store/MonitoringModelStore.scala (2)
34-37
: LGTM! Good architectural improvement.The transition from DynamoDB-specific naming to a more generic store interface improves modularity and allows for different storage implementations (like the InMemoryKV mentioned in PR objectives).
105-105
: LGTM! Consider adding performance monitoring.The transition to using
kvStore
is good. Since this is a critical data access point, consider adding performance monitoring metrics to track latency and success rates.- kvStore.list(listRequest).flatMap { response => + val startTime = System.currentTimeMillis() + kvStore.list(listRequest).flatMap { response => + val latency = System.currentTimeMillis() - startTime + Metrics.recordTimer("kvstore.list.latency", latency) + Metrics.incrementCounter("kvstore.list.success")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (8)
hub/app/controllers/InMemKVStoreController.scala
(1 hunks)hub/app/controllers/JoinController.scala
(1 hunks)hub/app/model/Model.scala
(2 hunks)hub/app/store/MonitoringModelStore.scala
(4 hunks)hub/conf/routes
(1 hunks)hub/test/controllers/JoinControllerSpec.scala
(1 hunks)hub/test/controllers/ModelControllerSpec.scala
(3 hunks)hub/test/controllers/SearchControllerSpec.scala
(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- hub/app/controllers/InMemKVStoreController.scala
- hub/test/controllers/JoinControllerSpec.scala
- hub/test/controllers/ModelControllerSpec.scala
- hub/test/controllers/SearchControllerSpec.scala
🔇 Additional comments (5)
hub/app/controllers/JoinController.scala (2)
24-40
: Add error handling and consider performance optimization
The previous review comment about error handling and response structure is still valid. Additionally:
- Consider adding result size limits to prevent memory issues with large datasets
- The
getJoins
call might be expensive - consider caching or streaming results
Consider these additional improvements:
def list(offset: Option[Int], limit: Option[Int]): Action[AnyContent] =
Action { implicit request: Request[AnyContent] =>
+ // Add upper bound for result set size
+ val maxResultSize = 10000
val offsetValue = offset.getOrElse(defaultOffset)
- val limitValue = limit.map(l => math.min(l, maxLimit)).getOrElse(defaultLimit)
+ val limitValue = limit
+ .map(l => math.min(l, maxLimit))
+ .getOrElse(defaultLimit)
+
+ if (offsetValue > maxResultSize) {
+ BadRequest(s"Offset exceeds maximum allowed value of $maxResultSize")
+ } else
45-53
:
Optimize join retrieval and add input validation
The current implementation has several areas for improvement:
- Loading all joins to find one is inefficient
- No validation or sanitization of the name parameter
- Missing error handling for store operations
Consider this implementation:
def get(name: String): Action[AnyContent] = {
Action { implicit request: Request[AnyContent] =>
+ if (name == null || name.trim.isEmpty) {
+ BadRequest("Join name cannot be empty")
+ } else {
+ Try {
+ // Assuming monitoringStore can be updated to support direct lookup
+ monitoringStore.getJoinByName(name.trim)
+ } match {
+ case Success(Some(join)) => Ok(join.asJson.noSpaces)
+ case Success(None) => NotFound(s"Join: $name wasn't found")
+ case Failure(ex) =>
+ Logger.error(s"Failed to fetch join: $name", ex)
+ InternalServerError(s"Failed to fetch join: ${ex.getMessage}")
+ }
+ }
}
}
Let's verify the potential impact of loading all joins:
hub/conf/routes (2)
4-5
: LGTM! Verify error handling for join endpoints.
The new join endpoints follow RESTful conventions and properly support pagination. They align well with the PR objective of transitioning to configured joins.
Let's verify the error handling implementation:
✅ Verification successful
Error handling in join endpoints is properly implemented
The JoinController implements comprehensive error handling:
GET /api/v1/joins
:- Returns
BadRequest
for negative offset/limit values - Properly handles pagination with default values
- Returns
GET /api/v1/join/:name
:- Returns
NotFound
with descriptive message when join doesn't exist - Successfully returns join data for valid names
- Returns
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check error handling in JoinController
# Expect: Error handling for invalid join names and proper validation
ast-grep --pattern 'def get(name: String) = Action { implicit request =>
$$$
case None => NotFound($$$)
$$$
}'
Length of output: 118
Script:
#!/bin/bash
# Let's try a broader search for the JoinController implementation
rg -l "class JoinController"
# Then search for error handling patterns in Play Framework
rg "NotFound|BadRequest|InternalServerError" --type scala
Length of output: 3683
Script:
#!/bin/bash
# Let's examine the full implementation of JoinController to understand the complete error handling
rg -A 10 "class JoinController" hub/app/controllers/JoinController.scala
Length of output: 589
Script:
#!/bin/bash
# Let's get the complete file content to see all error handling patterns
cat hub/app/controllers/JoinController.scala
Length of output: 1861
8-9
: Clarify the status and impact of the TBD model endpoint.
While commenting out this endpoint aligns with removing mock data endpoints, the TBD status needs clarification as it might block PR completion.
Let's check for any remaining dependencies on this endpoint:
✅ Verification successful
The commented model timeseries endpoint can be safely removed
The commented endpoint /api/v1/model/:id/timeseries
is not implemented in the codebase. The TimeSeriesController only has implementations for join and feature-level timeseries endpoints, and there are no test cases for model-level timeseries. The TBD comment can be safely removed along with the commented route as this functionality is not part of the current implementation.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any remaining references to the model timeseries endpoint
# Expect: No active code should be depending on this endpoint
# Search for any remaining references to fetchModel
rg "fetchModel|model/:id/timeseries" --type ts --type js --type scala
# Check for any TODO/FIXME comments related to this
rg "TODO.*model.*timeseries|FIXME.*model.*timeseries"
Length of output: 127
Script:
#!/bin/bash
# Let's broaden our search to find any related implementations or references
# Check for TimeSeriesController implementation
fd -e scala "TimeSeriesController"
# Search for any references to timeseries in routes file
cat hub/conf/routes
# Look for any model-related endpoints or implementations
rg "model.*endpoint|model.*api" --type scala
Length of output: 4254
Script:
#!/bin/bash
# Let's check the TimeSeriesController implementation to understand if the commented endpoint has any existing code
# Check TimeSeriesController implementation
cat hub/app/controllers/TimeSeriesController.scala
# Check if there are any references to model timeseries in tests
cat hub/test/controllers/TimeSeriesControllerSpec.scala
Length of output: 28238
hub/app/store/MonitoringModelStore.scala (1)
77-91
: Add error handling and optimize performance.
-
The previous review comment about error handling is still valid.
-
Consider collecting the results in parallel for better performance when dealing with large datasets.
def getJoins: Seq[Join] = {
- configRegistryCache("default").joins.map { thriftJoin =>
+ configRegistryCache("default").joins.par.map { thriftJoin =>
val groupBys = thriftJoin.joinParts.asScala.map { part =>
GroupBy(part.groupBy.metaData.name, part.groupBy.valueColumns)
}
val outputColumns = thriftJoin.outputColumnsByGroup.getOrElse("derivations", Array.empty)
Join(thriftJoin.metaData.name,
outputColumns,
groupBys,
thriftJoin.metaData.online,
thriftJoin.metaData.production,
thriftJoin.metaData.team)
- }
+ }.seq
}
# Temporary in-memory kv store endpoint | ||
POST /api/v1/dataset/data controllers.InMemKVStoreController.bulkPut() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Memory management and data persistence concerns identified
The implementation shows several potential issues:
- The in-memory store uses unbounded
ConcurrentHashMap
andBuffer
collections without size limits - No eviction policy is implemented for old data
- Play's memory buffer is capped at 512K per request, but there's no limit on total store size
- No persistence mechanism for data recovery
Key locations requiring attention:
spark/src/main/scala/ai/chronon/spark/utils/InMemoryKvStore.scala
: Implement size limits and eviction policyhub/app/module/InMemoryKVStoreModule.scala
: Add configuration for memory limitsapplication.conf
: Consider increasingmaxMemoryBuffer
if needed for bulk operations
Recommendations:
- Add configurable size limits for the ConcurrentHashMap
- Implement LRU or time-based eviction for the VersionedData buffers
- Add monitoring for memory usage
- Consider implementing persistence to disk for recovery
- Add metrics for store size and operation latencies
🔗 Analysis chain
Verify memory management and data persistence strategy.
While this temporary solution addresses the DynamoDB ingestion issues, we should ensure proper memory management and data persistence guarantees.
Let's check the implementation details:
Consider:
- Implementing memory limits for the in-memory store
- Adding monitoring for memory usage
- Implementing data eviction policies
- Planning for persistence strategy if the temporary solution becomes long-term
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify memory management and bulk operation handling
# Expect: Memory limits and error handling for large requests
# Check for memory management code
ast-grep --pattern 'class InMemKVStoreController $$ {
$$$
def bulkPut() = Action(parse.raw) { implicit request =>
$$$
}
}'
# Look for configuration related to memory limits
rg "memory|heap|buffer" --type conf
Length of output: 201
Script:
#!/bin/bash
# Let's try different approaches to find the implementation and configuration
# Find the controller file
fd "InMemKVStoreController"
# Search for memory-related configurations in all files
rg -t scala -t conf -t properties -t yaml "memory|heap|buffer"
# Look for the controller implementation
ast-grep --pattern 'class InMemKVStoreController'
# Search for any data store or cache implementation related to this controller
rg -t scala "class.*Store|class.*Cache"
Length of output: 8655
Script:
#!/bin/bash
# Let's examine the InMemoryKVStore implementation and module configuration
# Check InMemoryKvStore implementation
cat spark/src/main/scala/ai/chronon/spark/utils/InMemoryKvStore.scala
# Check the module configuration
cat hub/app/module/InMemoryKVStoreModule.scala
# Look for any size/capacity configurations in application.conf
fd "application.conf" --type f --exec cat {}
Length of output: 8099
val outputColumns = thriftJoin.outputColumnsByGroup.getOrElse("derivations", Array.empty) | ||
val join = Join(thriftJoin.metaData.name, | ||
outputColumns, | ||
groupBys, | ||
thriftJoin.metaData.online, | ||
thriftJoin.metaData.production, | ||
thriftJoin.metaData.team) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add null checks for metadata fields.
While the output column handling is improved with getOrElse
, the metadata fields (online
, production
, team
) should be validated to prevent potential NullPointerException.
- val join = Join(thriftJoin.metaData.name,
- outputColumns,
- groupBys,
- thriftJoin.metaData.online,
- thriftJoin.metaData.production,
- thriftJoin.metaData.team)
+ val metadata = thriftJoin.metaData
+ if (metadata == null) {
+ logger.error(s"Missing metadata for join in model ${m.metaData.name}")
+ throw new IllegalStateException("Missing join metadata")
+ }
+ val join = Join(
+ metadata.name,
+ outputColumns,
+ groupBys,
+ Option(metadata.online).getOrElse(false),
+ Option(metadata.production).getOrElse(false),
+ Option(metadata.team).getOrElse("")
+ )
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
val outputColumns = thriftJoin.outputColumnsByGroup.getOrElse("derivations", Array.empty) | |
val join = Join(thriftJoin.metaData.name, | |
outputColumns, | |
groupBys, | |
thriftJoin.metaData.online, | |
thriftJoin.metaData.production, | |
thriftJoin.metaData.team) | |
val outputColumns = thriftJoin.outputColumnsByGroup.getOrElse("derivations", Array.empty) | |
val metadata = thriftJoin.metaData | |
if (metadata == null) { | |
logger.error(s"Missing metadata for join in model ${m.metaData.name}") | |
throw new IllegalStateException("Missing join metadata") | |
} | |
val join = Join( | |
metadata.name, | |
outputColumns, | |
groupBys, | |
Option(metadata.online).getOrElse(false), | |
Option(metadata.production).getOrElse(false), | |
Option(metadata.team).getOrElse("") | |
) |
docker-init/demo/load_summaries.sh
Outdated
--driver-class-path "/opt/spark/jars/*:/app/cli/*" \ | ||
--conf "spark.driver.host=localhost" \ | ||
--conf "spark.driver.bindAddress=0.0.0.0" \ | ||
--class ai.chronon.spark.scripts.ObservabilityDemo \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made some changes to this to make logs appear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been seeing logs in my runs:
$ ./docker-init/demo/load_summaries.sh
...
24/11/26 23:05:23 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, localhost, 39793, None)
---- running query ----
CREATE DATABASE IF NOT EXISTS observability_demo
---- call path ----
TableUtils.sql(TableUtils.scala:368)
TableUtils.createDatabase(TableUtils.scala:143)
scripts.ObservabilityDemo$.main(ObservabilityDemo.scala:67)
scripts.ObservabilityDemo.main(ObservabilityDemo.scala)
---- end ----
generating anomalous data for fraud
Generated 0/700000 rows of data.
Generated 100000/700000 rows of data.
...
Finished computing range 2/2
input: observability_demo.risk_user_transactions_txn_join_summary (2023-01-31 -> 2023-02-27)
output: observability_demo.risk_user_transactions_txn_join_summary_packed (2023-01-31 -> 2023-02-27)
Done uploading summaries! 🥳
0da87a9
to
45532b7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Outside diff range and nitpick comments (19)
docker-init/demo/load_summaries.sh (1)
8-12
: Consider parameterizing hardcoded pathsThe script contains hardcoded paths which could make it less portable across different environments:
/opt/spark/jars/*
/app/cli/*
/app/cli/spark.jar
Consider making these configurable through environment variables:
+# Default paths (can be overridden through environment variables) +SPARK_JARS_PATH=${SPARK_JARS_PATH:-"/opt/spark/jars/*"} +APP_CLI_PATH=${APP_CLI_PATH:-"/app/cli"} +SPARK_JAR_PATH=${SPARK_JAR_PATH:-"$APP_CLI_PATH/spark.jar"} + docker-compose -f docker-init/compose.yaml exec app /opt/spark/bin/spark-submit \ --master "local[*]" \ --driver-memory 8g \ --conf "spark.driver.maxResultSize=6g" \ --conf "spark.driver.memory=8g" \ - --driver-class-path "/opt/spark/jars/*:/app/cli/*" \ + --driver-class-path "$SPARK_JARS_PATH:$APP_CLI_PATH/*" \ --conf "spark.driver.host=localhost" \ --conf "spark.driver.bindAddress=0.0.0.0" \ --class ai.chronon.spark.scripts.ObservabilityDemo \ - /app/cli/spark.jar + "$SPARK_JAR_PATH"hub/conf/application.conf (1)
31-33
: Consider data persistence and recovery strategy with in-memory storageThe shift from DynamoDB to in-memory storage resolves the ingestion performance issues but introduces new considerations:
- Data persistence across service restarts
- Recovery mechanism in case of crashes
- Memory management for growing datasets
Consider implementing:
- Periodic snapshots to persistent storage
- Memory monitoring and alerting
- Data retention/cleanup policies
hub/app/controllers/SearchController.scala (2)
20-24
: Enhance API documentation with response format details.While the documentation clearly explains the parameters, it would be helpful to include:
- Response format and example
- Any specific limitations on join name search
- Pagination response structure
Consider adding:
/** * Powers the /api/v1/search endpoint. Returns a list of joins * @param term - Search term to search for (currently we only support searching join names) * @param offset - For pagination. We skip over offset entries before returning results * @param limit - Number of elements to return + * @return JSON response containing: + * - offset: Current offset + * - joins: Array of matching joins, each containing: + * - name: Join name + * - [other join properties...] */
43-46
: Consider improving search performance and accuracy.The current implementation has several potential limitations:
- Case-sensitive string matching might miss relevant results
- Full scan of all joins for each search request
- No caching strategy visible
Consider these improvements:
private def searchRegistry(term: String): Seq[Join] = { + // Convert to lowercase for case-insensitive search + val searchTerm = term.toLowerCase val joins = monitoringStore.getJoins - joins.filter(j => j.name.contains(term)) + joins.filter(j => j.name.toLowerCase.contains(searchTerm)) }Additionally, consider:
- Adding caching for frequently accessed joins
- Implementing a more sophisticated search algorithm for large datasets
- Adding fuzzy matching for better search results
For better scalability, consider:
- Implementing an index-based search solution
- Adding a caching layer with TTL
- Using a dedicated search service (e.g., Elasticsearch) for large datasets
hub/app/controllers/JoinController.scala (2)
11-13
: Enhance controller documentationConsider adding more detailed documentation including:
- Available endpoints and their paths
- Response formats
- Pagination details
- Error scenarios
/** * Controller for the Zipline Join entities + * + * Endpoints: + * - GET /api/v1/joins?offset={offset}&limit={limit} - List all joins with pagination + * - GET /api/v1/join/{name} - Get a specific join by name + * + * Responses: + * - 200 OK: Successfully retrieved joins + * - 400 Bad Request: Invalid pagination parameters + * - 404 Not Found: Join not found (get endpoint) + * + * Pagination: + * - Default offset: 0 + * - Default limit: defined by defaultLimit + * - Maximum limit: defined by maxLimit */
47-47
: Consider optimizing join retrievalThe current implementation fetches all joins from the store and then filters in memory. This could be inefficient with a large number of joins.
Consider pushing the filtering down to the store layer:
-val maybeJoin = monitoringStore.getJoins.find(j => j.name.equalsIgnoreCase(name)) +val maybeJoin = monitoringStore.getJoinByName(name) // New method to implement in storehub/app/model/Model.scala (2)
5-10
: Add documentation for new Join fields and consider status handling.While the new fields align well with the Model class structure, please consider:
- Adding ScalaDoc comments to document the purpose and valid values for each new field
- Consider using an enumeration or sealed trait for team instead of String to ensure valid values
- Consider if the boolean flags (online, production) might better be represented as a single status field for future extensibility
+/** Represents a join configuration with its features, groupings and operational status + * @param name Unique identifier for the join + * @param joinFeatures List of features used in the join + * @param groupBys Grouping configurations + * @param online Whether the join is available for online serving + * @param production Whether the join is used in production + * @param team Team responsible for maintaining this join + */ case class Join(name: String, joinFeatures: Seq[String], groupBys: Seq[GroupBy], online: Boolean, production: Boolean, team: String)
57-60
: Standardize response class naming and add pagination constraints.There's an inconsistency in the naming convention between "Search" and "List" prefixes. Additionally, the pagination implementation needs constraints:
- Standardize on either "Search" or "List" prefix for consistency
- Consider adding a maximum page size to prevent performance issues
Consider implementing a base trait for pagination:
trait PaginatedResponse[T] { def offset: Int def items: Seq[T] def maxPageSize: Int = 100 // Default max page size require(items.size <= maxPageSize, s"Page size cannot exceed $maxPageSize items") } case class ListModelResponse(offset: Int, items: Seq[Model]) extends PaginatedResponse[Model] case class ListJoinResponse(offset: Int, items: Seq[Join]) extends PaginatedResponse[Join]online/src/main/scala/ai/chronon/online/HTTPKVStore.scala (2)
29-31
: Add success logging for monitoringConsider adding debug/info level logging for successful operations to aid in monitoring and debugging.
response.code match { - case StatusCode.Ok => Seq(true) + case StatusCode.Ok => + logger.debug(s"Successfully put ${putRequests.length} items") + Seq(true)
11-11
: Add performance monitoring and documentationGiven that this implementation replaces DynamoDB to address performance issues:
- Add metrics collection for request latency and success rates
- Document the expected performance characteristics
- Add retry/backoff strategy documentation
Would you like assistance in implementing metrics collection using your metrics framework?
hub/conf/routes (1)
17-17
: Consider optimizing the API design for feature timeseries.While the addition of the join parameter aligns with the PR objectives, the endpoint has numerous query parameters which could be better handled using a request body. This would:
- Improve readability and maintainability
- Make it easier to validate input
- Allow for future parameter additions without URL length concerns
Consider refactoring to use POST with a request body:
-GET /api/v1/join/:join/feature/:name/timeseries controllers.TimeSeriesController.fetchFeature(join: String, name: String, startTs: Long, endTs: Long, metricType: String, metrics: String, granularity: String, offset: Option[String], algorithm: Option[String]) +POST /api/v1/join/:join/feature/:name/timeseries controllers.TimeSeriesController.fetchFeature(join: String, name: String)With a request body schema:
{ "startTs": number, "endTs": number, "metricType": string, "metrics": string, "granularity": string, "offset": string?, "algorithm": string? }hub/app/store/MonitoringModelStore.scala (1)
77-91
: Consider refactoring to reduce code duplication.The join processing logic is duplicated between
getModels
andgetJoins
. Consider extracting the common logic into a shared method.+ private def processJoin(thriftJoin: api.Join): Join = { + val metadata = Option(thriftJoin.metaData).getOrElse( + throw new IllegalStateException("Missing join metadata") + ) + + val groupBys = thriftJoin.joinParts.asScala.map { part => + if (part.groupBy == null || part.groupBy.metaData == null) { + throw new IllegalStateException(s"Invalid join configuration: missing groupBy metadata in join ${metadata.name}") + } + GroupBy(part.groupBy.metaData.name, part.groupBy.valueColumns) + } + + val outputColumns = thriftJoin.outputColumnsByGroup.getOrElse("derivations", Array.empty) + Join( + metadata.name, + outputColumns, + groupBys, + Option(metadata.online).getOrElse(false), + Option(metadata.production).getOrElse(false), + Option(metadata.team).getOrElse("") + ) + }hub/app/controllers/TimeSeriesController.scala (4)
53-62
: Consider API versioning for breaking changesThe addition of the
join
parameter tofetchFeature
is a breaking change. Consider implementing API versioning (e.g., via URL path/v2/feature
or Accept header) to maintain backward compatibility.
90-91
: Extract error handling to a reusable functionThe error handling pattern for offset and algorithm validation is repeated across methods. Consider extracting it to a reusable function.
+ private def validateDriftParams(offset: Option[String], algorithm: Option[String]): Either[Result, (Duration, DriftMetric)] = { + (parseOffset(offset), parseAlgorithm(algorithm)) match { + case (None, _) => Left(BadRequest(s"Unable to parse offset - $offset")) + case (_, None) => Left(BadRequest("Invalid drift algorithm. Expect JSD, PSI or Hellinger")) + case (Some(o), Some(a)) => Right((o, a)) + } + }
96-112
: Consider using for-comprehension for cleaner error handlingThe nested match cases and Future transformations can be simplified using for-comprehension.
- maybeDriftSeries match { - case Failure(exception) => - Future.successful(InternalServerError(s"Error computing join drift - ${exception.getMessage}")) - case Success(driftSeriesFuture) => - driftSeriesFuture.map { driftSeries => + val result = for { + driftSeriesFuture <- Future.fromTry(maybeDriftSeries) + driftSeries <- driftSeriesFuture + } yield { val grpToDriftSeriesList = driftSeries.groupBy(_.key.groupName) // ... rest of the logic ... Ok(tsData.asJson.noSpaces) - } } + result.recover { + case e => InternalServerError(s"Error computing join drift - ${e.getMessage}") + }
26-28
: Consider splitting controller responsibilitiesThis controller handles multiple distinct responsibilities (join operations, feature operations, time series conversions). Consider splitting it into separate controllers (e.g.,
JoinTimeSeriesController
,FeatureTimeSeriesController
) to improve maintainability and follow the Single Responsibility Principle.hub/test/controllers/TimeSeriesControllerSpec.scala (3)
Line range hint
87-107
: Enhance test coverage with more specific assertions.While the test covers the basic success case, consider adding more specific assertions:
- Verify the exact number of items returned
- Assert the specific values in the response points
- Validate the structure of each group
response.items.foreach { grpByTs => grpByTs.items.isEmpty mustBe false grpByTs.items.foreach(featureTs => featureTs.points.length mustBe expectedLength) + // Assert specific values + grpByTs.items.foreach { featureTs => + featureTs.points.foreach { point => + point.percentileDrift must (be >= 0.0 and be <= 1.0) + point.nullRatioChange must (be >= 0.0 and be <= 1.0) + } + } }
217-246
: Add validation for mock categories in categorical feature tests.The test should validate that the mock categories used in the response match the expected categories.
response.current.zip(response.baseline).foreach { case (current, baseline) => (current.ts - baseline.ts) mustBe offset.toMillis current.label.isEmpty mustBe false baseline.label.isEmpty mustBe false + mockCategories must contain(current.label.get) + mockCategories must contain(baseline.label.get) }
271-275
: Extract magic numbers into named constants.The test data generation uses several magic numbers that would be clearer as named constants:
- 0.12 for percentile drifts
- 0.23 for histogram drifts
- 0.25 for null ratio changes
+ private val MockPercentileDrift = 0.12 + private val MockHistogramDrift = 0.23 + private val MockNullRatioChange = 0.25 + private val MockHistogramCount = 1L private def generateDriftSeries(...) = { ... - val percentileDrifts = if (isNumeric) List.fill(timestamps.size())(JDouble.valueOf(0.12)).asJava else ... - val histogramDrifts = if (isNumeric) List.fill[JDouble](timestamps.size())(null).asJava else List.fill(timestamps.size())(JDouble.valueOf(0.23)).asJava - val nullRationChangePercents = List.fill(timestamps.size())(JDouble.valueOf(0.25)).asJava + val percentileDrifts = if (isNumeric) List.fill(timestamps.size())(JDouble.valueOf(MockPercentileDrift)).asJava else ... + val histogramDrifts = if (isNumeric) List.fill[JDouble](timestamps.size())(null).asJava else List.fill(timestamps.size())(JDouble.valueOf(MockHistogramDrift)).asJava + val nullRationChangePercents = List.fill(timestamps.size())(JDouble.valueOf(MockNullRatioChange)).asJavaAlso applies to: 302-302, 308-308
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (35)
.github/workflows/test_scala_no_spark.yaml
(1 hunks)build.sbt
(3 hunks)docker-init/Dockerfile
(1 hunks)docker-init/demo/Dockerfile
(0 hunks)docker-init/demo/README.md
(1 hunks)docker-init/demo/build.sh
(0 hunks)docker-init/demo/load_summaries.sh
(1 hunks)docker-init/demo/run.sh
(0 hunks)docker-init/start.sh
(0 hunks)hub/app/controllers/InMemKVStoreController.scala
(1 hunks)hub/app/controllers/JoinController.scala
(1 hunks)hub/app/controllers/ModelController.scala
(1 hunks)hub/app/controllers/Paginate.scala
(1 hunks)hub/app/controllers/SearchController.scala
(2 hunks)hub/app/controllers/TimeSeriesController.scala
(3 hunks)hub/app/model/Model.scala
(2 hunks)hub/app/module/DriftStoreModule.scala
(1 hunks)hub/app/module/DynamoDBModule.scala
(0 hunks)hub/app/module/InMemoryKVStoreModule.scala
(1 hunks)hub/app/module/ModelStoreModule.scala
(1 hunks)hub/app/store/MonitoringModelStore.scala
(4 hunks)hub/conf/application.conf
(1 hunks)hub/conf/routes
(1 hunks)hub/test/controllers/JoinControllerSpec.scala
(1 hunks)hub/test/controllers/ModelControllerSpec.scala
(3 hunks)hub/test/controllers/SearchControllerSpec.scala
(4 hunks)hub/test/controllers/TimeSeriesControllerSpec.scala
(4 hunks)hub/test/store/MonitoringModelStoreTest.scala
(2 hunks)online/src/main/scala/ai/chronon/online/HTTPKVStore.scala
(1 hunks)online/src/main/scala/ai/chronon/online/MetadataEndPoint.scala
(1 hunks)online/src/main/scala/ai/chronon/online/stats/DriftStore.scala
(2 hunks)online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala
(1 hunks)spark/src/main/scala/ai/chronon/spark/Driver.scala
(2 hunks)spark/src/main/scala/ai/chronon/spark/TableUtils.scala
(1 hunks)spark/src/main/scala/ai/chronon/spark/scripts/ObservabilityDemo.scala
(3 hunks)
💤 Files with no reviewable changes (5)
- docker-init/demo/Dockerfile
- docker-init/demo/build.sh
- docker-init/demo/run.sh
- docker-init/start.sh
- hub/app/module/DynamoDBModule.scala
🚧 Files skipped from review as they are similar to previous changes (19)
- build.sbt
- docker-init/Dockerfile
- docker-init/demo/README.md
- hub/app/controllers/InMemKVStoreController.scala
- hub/app/controllers/ModelController.scala
- hub/app/controllers/Paginate.scala
- hub/app/module/DriftStoreModule.scala
- hub/app/module/InMemoryKVStoreModule.scala
- hub/app/module/ModelStoreModule.scala
- hub/test/controllers/JoinControllerSpec.scala
- hub/test/controllers/ModelControllerSpec.scala
- hub/test/controllers/SearchControllerSpec.scala
- hub/test/store/MonitoringModelStoreTest.scala
- online/src/main/scala/ai/chronon/online/MetadataEndPoint.scala
- online/src/main/scala/ai/chronon/online/stats/DriftStore.scala
- online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala
- spark/src/main/scala/ai/chronon/spark/Driver.scala
- spark/src/main/scala/ai/chronon/spark/TableUtils.scala
- spark/src/main/scala/ai/chronon/spark/scripts/ObservabilityDemo.scala
🧰 Additional context used
📓 Learnings (1)
hub/test/controllers/TimeSeriesControllerSpec.scala (1)
Learnt from: piyush-zlai
PR: zipline-ai/chronon#22
File: hub/test/controllers/TimeSeriesControllerSpec.scala:40-49
Timestamp: 2024-11-12T09:38:33.532Z
Learning: When testing with mocked data, it's acceptable to not add detailed assertions on the data contents.
🪛 Shellcheck (0.10.0)
docker-init/demo/load_summaries.sh
[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.
(SC2148)
🔇 Additional comments (24)
docker-init/demo/load_summaries.sh (2)
1-2
: Add shell script essentials
The script is missing crucial shell script components like shebang directive and documentation.
🧰 Tools
🪛 Shellcheck (0.10.0)
[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.
(SC2148)
3-12
: 🛠️ Refactor suggestion
Add error handling and environment validation
The script could benefit from some robustness improvements:
- Validate that required containers are running
- Check if required files exist
- Add error handling for the spark-submit command
Here's a suggested improvement:
+#!/bin/bash
+
+# Script: load_summaries.sh
+# Purpose: Initiates ObservabilityDemo Spark job in the app container
+
+set -e
+
+# Validate environment
+if ! docker-compose -f docker-init/compose.yaml ps | grep -q "app.*running"; then
+ echo "Error: App container is not running. Please start containers first."
+ exit 1
+fi
+
+if [ ! -f /app/cli/spark.jar ]; then
+ echo "Error: Spark JAR not found at /app/cli/spark.jar"
+ exit 1
+fi
+
+# Execute spark-submit
docker-compose -f docker-init/compose.yaml exec app /opt/spark/bin/spark-submit \
--master "local[*]" \
--driver-memory 8g \
--conf "spark.driver.maxResultSize=6g" \
--conf "spark.driver.memory=8g" \
--driver-class-path "/opt/spark/jars/*:/app/cli/*" \
--conf "spark.driver.host=localhost" \
--conf "spark.driver.bindAddress=0.0.0.0" \
--class ai.chronon.spark.scripts.ObservabilityDemo \
- /app/cli/spark.jar
+ /app/cli/spark.jar || {
+ echo "Error: Spark job failed"
+ exit 1
+}
Likely invalid or redundant comment.
hub/conf/application.conf (1)
31-33
: Verify memory configuration for in-memory storage
Since the application now uses in-memory storage, we should verify if the current memory configuration is sufficient.
✅ Verification successful
Let me gather more information about the in-memory storage implementation to better understand its requirements.
Based on the search results, I can now provide a final response regarding the memory configuration for in-memory storage:
Memory configuration appears sufficient with defined limits
The application has the following memory-related configurations in place:
- HTTP parser memory limits:
maxMemoryBuffer=512K
andmaxDiskBuffer=10M
- JVM memory settings in Docker:
-Xms1g -Xmx1g
- The InMemoryKVStore implementation is a simple in-memory store without explicit memory limits, but it's bounded by:
- The JVM heap limits above
- Natural limits from the data stored (only two datasets are created: MetadataDataset and TiledSummaryDataset)
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check memory-related configurations in the codebase
# Search for memory configurations in application.conf and related files
echo "Checking for memory configurations..."
rg -i "memory|heap|ram" --type conf
# Look for JVM memory settings in deployment files
echo -e "\nChecking for JVM memory settings..."
fd -e yaml -e yml -e properties | xargs rg -i "xmx|xms|java.*opts"
# Check for any memory management related code
echo -e "\nChecking for memory management implementations..."
ast-grep --pattern 'class $_ implements MemoryManager { $$$ }'
Length of output: 623
Script:
#!/bin/bash
# Check the implementation of InMemoryKVStoreModule and related classes
echo "Checking InMemoryKVStore implementation..."
rg -A 10 "class InMemoryKVStore"
# Look for cache or storage size configurations
echo -e "\nChecking for cache/storage size configurations..."
rg -i "cache\.size|store\.size|capacity|buffer"
# Check if there are any existing memory-related tests
echo -e "\nChecking for memory-related tests..."
fd -e test -e spec | xargs rg -i "memory|cache|store" -A 5
Length of output: 111843
.github/workflows/test_scala_no_spark.yaml (2)
63-64
: LGTM!
The formatting change improves readability by adding a consistent newline between steps.
65-68
: 🛠️ Refactor suggestion
Remove redundant SBT_OPTS export.
The SBT_OPTS
environment variable is already set globally in the "Set SBT resources" step. The local export in this step is redundant and could lead to maintenance issues if memory settings need to be adjusted in the future.
Apply this diff to remove the redundant export:
- name: Run hub tests
run: |
- export SBT_OPTS="-Xmx8G -Xms2G"
sbt "++ 2.12.18 hub/test"
Additionally, let's verify the hub module's test dependencies are properly configured:
✅ Verification successful
Remove redundant SBT_OPTS export
The verification confirms that SBT_OPTS
is indeed set globally via GITHUB_ENV
in the "Set SBT resources" step with the exact same values (-Xmx8G -Xms2G
). The local export in the "Run hub tests" step is redundant and should be removed as suggested.
- Remove the
export SBT_OPTS="-Xmx8G -Xms2G"
line from the "Run hub tests" step - The hub module's dependencies look properly configured in build.sbt with necessary test dependencies and no circular dependencies detected
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check hub module's test dependencies in build.sbt
# Check if hub module exists and has test dependencies
rg -A 10 'lazy val hub' build.sbt
# Check for potential circular dependencies
rg 'dependsOn.*hub' build.sbt
Length of output: 512
Script:
#!/bin/bash
# Check for SBT_OPTS setting in the workflow file
rg -B 5 -A 5 'SBT_OPTS' .github/workflows/test_scala_no_spark.yaml
Length of output: 475
hub/app/controllers/SearchController.scala (1)
5-15
: LGTM! Verify MonitoringModelStore implementation.
The transition from DynamoDBMonitoringStore to MonitoringModelStore improves abstraction and aligns with addressing the DynamoDB performance issues mentioned in the PR objectives.
Let's verify the MonitoringModelStore implementation:
✅ Verification successful
MonitoringModelStore implementation verified and properly used
The transition from DynamoDBMonitoringStore to MonitoringModelStore is well-implemented:
- MonitoringModelStore is properly defined in
hub/app/store/MonitoringModelStore.scala
and initialized with API implementation - It's correctly bound in
ModelStoreModule
for dependency injection - Used consistently across multiple controllers (Search, Join, Model)
- Has comprehensive test coverage in
MonitoringModelStoreTest
- The implementation maintains the required functionality for both model and join operations
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify MonitoringModelStore implementation and its usage
# Expected: Find the implementation and other potential usages
# Find MonitoringModelStore implementation
ast-grep --pattern 'class MonitoringModelStore {
$$$
}'
# Find other controllers using MonitoringModelStore
ast-grep --pattern 'class $_ @Inject() ($$$, monitoringStore: MonitoringModelStore) {
$$$
}'
Length of output: 156
Script:
#!/bin/bash
# Let's try a different approach to find MonitoringModelStore
# Find the file containing MonitoringModelStore definition
fd "MonitoringModelStore.scala"
# Search for MonitoringModelStore usage with ripgrep for better context
rg "MonitoringModelStore" -A 3
# Look for trait or abstract class definition as well
ast-grep --pattern 'trait MonitoringModelStore {
$$$
}'
# Search for any references to monitoring store in the codebase
rg "monitoring.*store" -i
Length of output: 9116
hub/app/controllers/JoinController.scala (1)
24-40
:
Add error handling and improve response structure
The current implementation needs improvement in error handling and response structure.
Additionally, consider adding request logging for better observability:
def list(offset: Option[Int], limit: Option[Int]): Action[AnyContent] =
Action { implicit request: Request[AnyContent] =>
+ Logger.info(s"Received request to list joins with offset=$offset, limit=$limit")
hub/app/model/Model.scala (1)
Line range hint 5-60
: Verify usage of renamed response types and new Join fields.
Let's ensure all usages of the renamed and modified classes have been updated accordingly.
✅ Verification successful
Let me generate additional verification scripts to check for any potential issues with the response types and pagination patterns.
Response types and Join fields are consistently used across the codebase
The verification shows that:
SearchModelResponse
has been completely removed with no lingering references- The new response types (
ListModelResponse
,SearchJoinResponse
,ListJoinResponse
) are properly used in their respective controllers Join
case class fields are consistently used in the codebase, with proper initialization inMonitoringModelStore
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for any remaining references to SearchModelResponse
rg "SearchModelResponse"
# Check for consistent usage of new Join fields
ast-grep --pattern 'new Join($$$)'
# Check for potential pagination usage patterns
rg -A 2 "items: Seq\[(Model|Join)\]"
Length of output: 5930
Script:
#!/bin/bash
# Check for any references to SearchModelResponse in controllers and services
rg -t scala "SearchModelResponse" --glob "hub/app/**/*"
# Check for response type usage patterns in controllers
rg -t scala "(ListModelResponse|SearchJoinResponse|ListJoinResponse)" -A 3 --glob "hub/app/**/*"
# Look for Join constructor calls in hub module
rg -t scala "Join\(" -A 3 --glob "hub/app/**/*"
Length of output: 4251
online/src/main/scala/ai/chronon/online/HTTPKVStore.scala (4)
13-14
:
Backend initialization needs proper serialization handling
The backend initialization could cause serialization issues in Spark executors.
16-16
:
Implement or document unimplemented methods
The multiGet
and bulkPut
methods are marked with ???
.
Also applies to: 39-39
18-37
:
Improve error handling and response mapping in multiPut
Several issues in the multiPut implementation need to be addressed.
46-50
:
Fix potential NPE and improve JSON string construction
The current implementation has several issues that need to be addressed.
hub/conf/routes (3)
4-5
: LGTM! Well-designed RESTful endpoints for joins.
The new join endpoints follow REST conventions and maintain consistency with existing endpoints. The pagination parameters for the list endpoint align with the model listing endpoint.
19-20
:
Address memory management for bulk data ingestion.
The bulk ingestion endpoint using raw parser could lead to memory issues with large payloads.
Consider:
- Adding request size limits in
application.conf
- Implementing streaming for large payloads
- Adding rate limiting for the endpoint
Example configuration:
play.http.parser.maxMemoryBuffer=512K
play.http.parser.maxDiskBuffer=10MB
8-9
: Clarify the status and plan for model timeseries endpoint.
The comment indicates this endpoint is "TBD at the moment" but doesn't explain:
- Why it's temporarily disabled
- What's the plan for replacing/re-enabling it
- How this affects the frontend integration mentioned in PR objectives
Let's check for any related model timeseries implementations:
hub/app/store/MonitoringModelStore.scala (4)
34-37
: LGTM! Good abstraction of implementation details.
The renaming from DynamoDBMonitoringStore to MonitoringModelStore and dynamoDBKVStore to kvStore improves the abstraction by removing implementation details, making the code more maintainable and flexible for future storage backend changes.
105-105
: LGTM! Consistent with the store abstraction.
The change from dynamoDBKVStore.list
to kvStore.list
maintains consistency with the store abstraction changes.
62-68
:
Add null checks for metadata fields.
While the output column handling is improved with getOrElse
, the metadata fields still need null checks to prevent potential NullPointerException.
Previous review suggested adding null checks for metadata fields. This is still applicable:
- val join = Join(thriftJoin.metaData.name,
- outputColumns,
- groupBys,
- thriftJoin.metaData.online,
- thriftJoin.metaData.production,
- thriftJoin.metaData.team)
+ val metadata = thriftJoin.metaData
+ if (metadata == null) {
+ logger.error(s"Missing metadata for join in model ${m.metaData.name}")
+ throw new IllegalStateException("Missing join metadata")
+ }
+ val join = Join(
+ metadata.name,
+ outputColumns,
+ groupBys,
+ Option(metadata.online).getOrElse(false),
+ Option(metadata.production).getOrElse(false),
+ Option(metadata.team).getOrElse("")
+ )
77-91
:
Add error handling for join configurations.
The getJoins method should include error handling for invalid join configurations to prevent silent failures.
Previous review suggested adding error handling. This is still applicable:
def getJoins: Seq[Join] = {
configRegistryCache("default").joins.map { thriftJoin =>
+ try {
val groupBys = thriftJoin.joinParts.asScala.map { part =>
+ if (part.groupBy == null || part.groupBy.metaData == null) {
+ logger.error(s"Invalid join configuration: missing groupBy metadata in join ${thriftJoin.metaData.name}")
+ throw new IllegalStateException("Invalid join configuration: missing groupBy metadata")
+ }
GroupBy(part.groupBy.metaData.name, part.groupBy.valueColumns)
}
val outputColumns = thriftJoin.outputColumnsByGroup.getOrElse("derivations", Array.empty)
Join(thriftJoin.metaData.name, outputColumns, groupBys)
+ } catch {
+ case e: Exception =>
+ logger.error(s"Failed to process join configuration: ${thriftJoin.metaData.name}", e)
+ throw e
+ }
}
}
hub/app/controllers/TimeSeriesController.scala (2)
3-28
: LGTM: Good transition to async operations
The addition of DriftStore dependency and ExecutionContext follows Play Framework best practices for handling async operations. This will improve scalability and responsiveness.
209-214
: Previous null safety concerns still apply
The null-checking patterns in both conversion methods could lead to NullPointerException. Consider using Option types as suggested in previous reviews.
Also applies to: 231-238
hub/test/controllers/TimeSeriesControllerSpec.scala (3)
Line range hint 3-42
: LGTM! Well-structured test setup with proper mocking.
The test setup follows best practices with:
- Appropriate imports for the required functionality
- Proper mock initialization for DriftStore
- Clear test organization with PlaySpec
275-275
: Typo in variable name nullRationChangePercents
288-288
: Duplicate setNodeName
call on tileKey
--master "local[*]" \ | ||
--driver-memory 8g \ | ||
--conf "spark.driver.maxResultSize=6g" \ | ||
--conf "spark.driver.memory=8g" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Memory configuration in demo environment needs adjustment
The verification reveals inconsistency in memory configurations across the codebase. While the demo environment uses fixed high values (8GB driver memory, 6GB max result size), other scripts in the test environment use a more flexible approach with environment variables defaulting to 1GB, which is more suitable for demo purposes.
Suggested changes:
- Consider adopting the same environment variable approach used in test scripts (
DRIVER_MEMORY:-1G
) - Reduce the default memory allocation to be more accommodating for demo environments
- Remove the duplicate memory configuration (
--driver-memory
andspark.driver.memory
)
🔗 Analysis chain
Review memory configuration for demo environment
The current configuration allocates 8GB of memory and 6GB for max result size, which might be excessive for a demo environment and could cause issues on machines with limited resources.
Let's check if these memory settings are documented or consistent across the codebase:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for similar Spark memory configurations
rg -A 2 "driver-memory|maxResultSize" --type yaml --type sh
Length of output: 935
def get(name: String): Action[AnyContent] = { | ||
Action { implicit request: Request[AnyContent] => | ||
val maybeJoin = monitoringStore.getJoins.find(j => j.name.equalsIgnoreCase(name)) | ||
maybeJoin match { | ||
case None => NotFound(s"Join: $name wasn't found") | ||
case Some(join) => Ok(join.asJson.noSpaces) | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve error handling and input validation in get endpoint
The endpoint needs several improvements:
- Add error handling for store operations
- Validate the name parameter
- Make response format consistent with list endpoint
Apply these improvements:
def get(name: String): Action[AnyContent] = {
Action { implicit request: Request[AnyContent] =>
+ Logger.info(s"Received request to get join with name=$name")
+
+ if (name.trim.isEmpty) {
+ BadRequest("Join name cannot be empty")
+ } else Try {
val maybeJoin = monitoringStore.getJoins.find(j => j.name.equalsIgnoreCase(name))
maybeJoin match {
case None => NotFound(s"Join: $name wasn't found")
- case Some(join) => Ok(join.asJson.noSpaces)
+ case Some(join) => Ok(Map("data" -> join).asJson.noSpaces)
}
+ } recover {
+ case ex: Exception =>
+ Logger.error(s"Failed to fetch join: $name", ex)
+ InternalServerError(s"Failed to fetch join: ${ex.getMessage}")
+ } get
}
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def get(name: String): Action[AnyContent] = { | |
Action { implicit request: Request[AnyContent] => | |
val maybeJoin = monitoringStore.getJoins.find(j => j.name.equalsIgnoreCase(name)) | |
maybeJoin match { | |
case None => NotFound(s"Join: $name wasn't found") | |
case Some(join) => Ok(join.asJson.noSpaces) | |
} | |
} | |
} | |
def get(name: String): Action[AnyContent] = { | |
Action { implicit request: Request[AnyContent] => | |
Logger.info(s"Received request to get join with name=$name") | |
if (name.trim.isEmpty) { | |
BadRequest("Join name cannot be empty") | |
} else Try { | |
val maybeJoin = monitoringStore.getJoins.find(j => j.name.equalsIgnoreCase(name)) | |
maybeJoin match { | |
case None => NotFound(s"Join: $name wasn't found") | |
case Some(join) => Ok(Map("data" -> join).asJson.noSpaces) | |
} | |
} recover { | |
case ex: Exception => | |
Logger.error(s"Failed to fetch join: $name", ex) | |
InternalServerError(s"Failed to fetch join: ${ex.getMessage}") | |
} get | |
} | |
} |
// Hacky test kv store that we use to send objects to the in-memory KV store that lives in a different JVM (e.g spark -> hub) | ||
class HTTPKVStore(host: String = "localhost", port: Int = 9000) extends KVStore with Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove or clarify the "hacky" designation
The comment indicates this is a "hacky test kv store", but based on the PR objectives, this appears to be used in a production context for drift metrics computation. Either:
- Remove the comment if this is intended for production use
- Document the temporary nature and create a tracking issue for proper implementation
override def create(dataset: String): Unit = { | ||
logger.warn(s"Skipping creation of $dataset in HTTP kv store implementation") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve create method implementation
The current implementation silently skips dataset creation which could lead to issues:
- No documentation explaining why creation is skipped
- No error propagation to caller
- Using warn level logging for expected behavior
Consider either:
- Implementing proper dataset creation
- Throwing UnsupportedOperationException with explanation
- Documenting why creation is unnecessary in this implementation
(maybeCurrentSummarySeries, maybeBaselineSummarySeries) match { | ||
case (Failure(exceptionA), Failure(exceptionB)) => | ||
Future.successful(InternalServerError( | ||
s"Error computing feature percentiles for current + offset time window.\nCurrent window error: ${exceptionA.getMessage}\nOffset window error: ${exceptionB.getMessage}")) | ||
case (_, Failure(exception)) => | ||
Future.successful( | ||
InternalServerError( | ||
s"Error computing feature percentiles for offset time window - ${exception.getMessage}")) | ||
case (Failure(exception), _) => | ||
Future.successful( | ||
InternalServerError( | ||
s"Error computing feature percentiles for current time window - ${exception.getMessage}")) | ||
case (Success(currentSummarySeriesFuture), Success(baselineSummarySeriesFuture)) => | ||
Future.sequence(Seq(currentSummarySeriesFuture, baselineSummarySeriesFuture)).map { merged => | ||
val currentSummarySeries = merged.head | ||
val baselineSummarySeries = merged.last | ||
val currentFeatureTs = convertTileSummarySeriesToTimeSeries(currentSummarySeries.head, metric) | ||
val baselineFeatureTs = convertTileSummarySeriesToTimeSeries(baselineSummarySeries.head, metric) | ||
val comparedTsData = ComparedFeatureTimeSeries(name, baselineFeatureTs, currentFeatureTs) | ||
Ok(comparedTsData.asJson.noSpaces) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Simplify complex Future composition with for-comprehension
The nested pattern matching for handling multiple Future results can be simplified using for-comprehension and Future.sequence.
- (maybeCurrentSummarySeries, maybeBaselineSummarySeries) match {
- case (Failure(exceptionA), Failure(exceptionB)) =>
- Future.successful(InternalServerError(...))
- case (_, Failure(exception)) =>
- Future.successful(InternalServerError(...))
- case (Failure(exception), _) =>
- Future.successful(InternalServerError(...))
- case (Success(currentSummarySeriesFuture), Success(baselineSummarySeriesFuture)) =>
+ val result = for {
+ current <- Future.fromTry(maybeCurrentSummarySeries).flatten
+ baseline <- Future.fromTry(maybeBaselineSummarySeries).flatten
+ } yield {
+ val currentFeatureTs = convertTileSummarySeriesToTimeSeries(current.head, metric)
+ val baselineFeatureTs = convertTileSummarySeriesToTimeSeries(baseline.head, metric)
+ val comparedTsData = ComparedFeatureTimeSeries(name, baselineFeatureTs, currentFeatureTs)
+ Ok(comparedTsData.asJson.noSpaces)
+ }
+ result.recover {
+ case e: Exception => InternalServerError(s"Error computing feature percentiles: ${e.getMessage}")
+ }
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
(maybeCurrentSummarySeries, maybeBaselineSummarySeries) match { | |
case (Failure(exceptionA), Failure(exceptionB)) => | |
Future.successful(InternalServerError( | |
s"Error computing feature percentiles for current + offset time window.\nCurrent window error: ${exceptionA.getMessage}\nOffset window error: ${exceptionB.getMessage}")) | |
case (_, Failure(exception)) => | |
Future.successful( | |
InternalServerError( | |
s"Error computing feature percentiles for offset time window - ${exception.getMessage}")) | |
case (Failure(exception), _) => | |
Future.successful( | |
InternalServerError( | |
s"Error computing feature percentiles for current time window - ${exception.getMessage}")) | |
case (Success(currentSummarySeriesFuture), Success(baselineSummarySeriesFuture)) => | |
Future.sequence(Seq(currentSummarySeriesFuture, baselineSummarySeriesFuture)).map { merged => | |
val currentSummarySeries = merged.head | |
val baselineSummarySeries = merged.last | |
val currentFeatureTs = convertTileSummarySeriesToTimeSeries(currentSummarySeries.head, metric) | |
val baselineFeatureTs = convertTileSummarySeriesToTimeSeries(baselineSummarySeries.head, metric) | |
val comparedTsData = ComparedFeatureTimeSeries(name, baselineFeatureTs, currentFeatureTs) | |
Ok(comparedTsData.asJson.noSpaces) | |
} | |
} | |
val result = for { | |
current <- Future.fromTry(maybeCurrentSummarySeries).flatten | |
baseline <- Future.fromTry(maybeBaselineSummarySeries).flatten | |
} yield { | |
val currentFeatureTs = convertTileSummarySeriesToTimeSeries(current.head, metric) | |
val baselineFeatureTs = convertTileSummarySeriesToTimeSeries(baseline.head, metric) | |
val comparedTsData = ComparedFeatureTimeSeries(name, baselineFeatureTs, currentFeatureTs) | |
Ok(comparedTsData.asJson.noSpaces) | |
} | |
result.recover { | |
case e: Exception => InternalServerError(s"Error computing feature percentiles: ${e.getMessage}") | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (1)
spark/src/main/scala/ai/chronon/spark/scripts/ObservabilityDemoDataLoader.scala (1)
23-29
: Consider enhancing the timing utility for robustness.While the implementation is functional, consider these improvements:
- Add exception handling to ensure timing completion is logged even if the block fails
- Use System.nanoTime() for more precise measurements
- Consider adding a fallback for environments where color codes aren't supported
Here's a suggested implementation:
- def time(message: String)(block: => Unit): Unit = { - logger.info(s"$message..".yellow) - val start = System.currentTimeMillis() - block - val end = System.currentTimeMillis() - logger.info(s"$message took ${end - start} ms".green) + def time(message: String)(block: => Unit): Unit = { + logger.info(s"$message..") + val start = System.nanoTime() + try { + block + } catch { + case e: Exception => + logger.error(s"$message failed: ${e.getMessage}") + throw e + } finally { + val end = System.nanoTime() + val duration = (end - start) / 1000000 // Convert to milliseconds + logger.info(s"$message took $duration ms") + } + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
build.sbt
(4 hunks)docker-init/demo/load_summaries.sh
(1 hunks)hub/app/controllers/InMemKVStoreController.scala
(1 hunks)spark/src/main/scala/ai/chronon/spark/scripts/ObservabilityDemoDataLoader.scala
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- build.sbt
- hub/app/controllers/InMemKVStoreController.scala
🧰 Additional context used
🪛 Shellcheck (0.10.0)
docker-init/demo/load_summaries.sh
[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.
(SC2148)
🔇 Additional comments (5)
docker-init/demo/load_summaries.sh (4)
1-2
: Add shell script essentials
The script is missing crucial shell script components:
- Shebang directive
- Execution permissions
- Usage instructions and requirements
🧰 Tools
🪛 Shellcheck (0.10.0)
[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.
(SC2148)
5-7
: 🛠️ Refactor suggestion
Simplify and optimize memory configuration
Issues with current memory settings:
- Driver memory is configured twice (lines 5 and 7)
- Fixed high memory values (8GB, 6GB) might be excessive for demo environments
Consider this simplified configuration:
- --driver-memory 8g \
- --conf "spark.driver.maxResultSize=6g" \
- --conf "spark.driver.memory=8g" \
+ --driver-memory "${DRIVER_MEMORY:-2g}" \
+ --conf "spark.driver.maxResultSize=${MAX_RESULT_SIZE:-2g}" \
Likely invalid or redundant comment.
9-10
: Verify network configuration for Docker environment
The current setup uses localhost
as driver host while binding to all interfaces (0.0.0.0
). This might cause connectivity issues in certain Docker network configurations.
✅ Verification successful
Let me gather more context about the Docker environment configuration to make a more informed decision.
Let me check the Spark configuration files to see if there are any default network settings that might provide more context.
Network configuration is appropriate for Docker environment
The current setup using localhost
as driver host with 0.0.0.0
bind address is actually the correct configuration for a Spark application running in Docker. This combination allows the driver to be accessible from within the container (localhost
) while still accepting connections from other containers or hosts (0.0.0.0
). This is a common pattern in Docker deployments and is consistent across both demo scripts (load_summaries.sh
and run.sh
).
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if there are any similar Spark network configurations in other scripts
# that might provide insights into the correct setup
rg -A 2 "spark.driver.(host|bindAddress)" --type yaml --type sh
Length of output: 673
Script:
#!/bin/bash
# Check Docker-related configuration files for network setup
fd Dockerfile --type f --exec cat {}
Length of output: 8341
Script:
#!/bin/bash
# Check Spark configuration files for network-related settings
fd "spark-defaults.conf" --type f --exec cat {}
Length of output: 678
8-8
: Verify application configuration and dependencies
Please verify:
- The existence of required JARs in the specified paths
- The main class name and its dependencies
Also applies to: 11-12
✅ Verification successful
Configuration and dependencies are properly set up
The verification confirms:
- The main class
ObservabilityDemoDataLoader
exists with a propermain
method inspark/src/main/scala/ai/chronon/spark/scripts/ObservabilityDemoDataLoader.scala
- The JAR paths in the script align with the Docker setup:
- Spark JARs are correctly located at
/opt/spark/jars/*
- Additional classpath entries are properly configured in the Docker environment
- Spark JARs are correctly located at
- The application is built with Spark 3.5.x and Scala 2.12, matching the Docker image configuration
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if the main class exists in the codebase
ast-grep --pattern 'class ObservabilityDemoDataLoader'
# Check if there are any references to this loader in other files
rg "ObservabilityDemoDataLoader" --type scala --type java
Length of output: 231
Script:
#!/bin/bash
# Check the implementation of ObservabilityDemoDataLoader
rg -A 10 "object ObservabilityDemoDataLoader" --type scala
# Check for JAR files in the repository
fd ".jar$"
# Check if there are any build or dependency files that might indicate required JARs
fd "build.sbt|pom.xml|build.gradle"
Length of output: 1400
Script:
#!/bin/bash
# Check the content of build.sbt for dependencies
cat build.sbt
# Check if there are any Dockerfile or docker-compose files that might show the JAR paths
fd "Dockerfile|docker-compose.yml"
# Look for any main method in ObservabilityDemoDataLoader
ast-grep --pattern 'def main(args: Array[String])'
Length of output: 11330
Script:
#!/bin/bash
# Check the Dockerfile in docker-init/demo for JAR paths
cat docker-init/demo/Dockerfile
# Check for the main method specifically in ObservabilityDemoDataLoader
rg -A 5 "def main" spark/src/main/scala/ai/chronon/spark/scripts/ObservabilityDemoDataLoader.scala
Length of output: 1473
spark/src/main/scala/ai/chronon/spark/scripts/ObservabilityDemoDataLoader.scala (1)
20-22
: LGTM! Logger setup follows Spark best practices.
The logger implementation correctly uses @transient and lazy initialization to prevent serialization issues in Spark.
val spark = SparkSessionBuilder.build(namespace, local = true) | ||
implicit val tableUtils: TableUtils = TableUtils(spark) | ||
tableUtils.createDatabase(namespace) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid hardcoding Spark configuration.
The local=true
parameter should be configurable through command-line options to support different deployment environments.
Add a configuration option:
class Conf(arguments: Seq[String]) extends ScallopConf(arguments) {
+ val localMode: ScallopOption[Boolean] = opt[Boolean](
+ name = "local-mode",
+ default = Some(false),
+ descr = "Run Spark in local mode"
+ )
}
def main(args: Array[String]): Unit = {
val config = new Conf(args)
- val spark = SparkSessionBuilder.build(namespace, local = true)
+ val spark = SparkSessionBuilder.build(namespace, local = config.localMode())
Committable suggestion skipped: line range outside the PR's diff.
time("Uploading summaries") { | ||
val uploader = new SummaryUploader(tableUtils.loadTable(packedTable), hubApi) | ||
uploader.run() | ||
} | ||
|
||
println("Done uploading summaries! \uD83E\uDD73".green) | ||
// clean up spark session and force jvm exit | ||
spark.stop() | ||
System.exit(0) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve cleanup and error handling.
The current implementation has several issues:
- Uses System.exit which bypasses proper cleanup
- Lacks try-with-resources pattern for Spark session
- Missing error handling for upload operation
Consider this improvement:
- time("Uploading summaries") {
- val uploader = new SummaryUploader(tableUtils.loadTable(packedTable), hubApi)
- uploader.run()
- }
-
- println("Done uploading summaries! \uD83E\uDD73".green)
- // clean up spark session and force jvm exit
- spark.stop()
- System.exit(0)
+ try {
+ time("Uploading summaries") {
+ val uploader = new SummaryUploader(tableUtils.loadTable(packedTable), hubApi)
+ uploader.run()
+ }
+ logger.info("Done uploading summaries! \uD83E\uDD73")
+ } catch {
+ case e: Exception =>
+ logger.error(s"Failed to upload summaries: ${e.getMessage}", e)
+ throw e
+ } finally {
+ if (spark != null) {
+ spark.stop()
+ }
+ }
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
time("Uploading summaries") { | |
val uploader = new SummaryUploader(tableUtils.loadTable(packedTable), hubApi) | |
uploader.run() | |
} | |
println("Done uploading summaries! \uD83E\uDD73".green) | |
// clean up spark session and force jvm exit | |
spark.stop() | |
System.exit(0) | |
} | |
try { | |
time("Uploading summaries") { | |
val uploader = new SummaryUploader(tableUtils.loadTable(packedTable), hubApi) | |
uploader.run() | |
} | |
logger.info("Done uploading summaries! \uD83E\uDD73") | |
} catch { | |
case e: Exception => | |
logger.error(s"Failed to upload summaries: ${e.getMessage}", e) | |
throw e | |
} finally { | |
if (spark != null) { | |
spark.stop() | |
} | |
} | |
} |
// mock api impl for online fetching and uploading | ||
val inMemKvStoreFunc: () => KVStore = () => { | ||
// cannot reuse the variable - or serialization error | ||
val result = InMemoryKvStore.build(namespace, () => null) | ||
result | ||
} | ||
val inMemoryApi = new MockApi(inMemKvStoreFunc, namespace) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve error handling for KVStore initialization.
The KVStore initialization passes a null function and lacks error handling.
Consider this improvement:
val inMemKvStoreFunc: () => KVStore = () => {
- // cannot reuse the variable - or serialization error
- val result = InMemoryKvStore.build(namespace, () => null)
- result
+ try {
+ InMemoryKvStore.build(namespace, () => throw new UnsupportedOperationException("Mock store doesn't support this operation"))
+ } catch {
+ case e: Exception =>
+ logger.error(s"Failed to initialize InMemoryKvStore: ${e.getMessage}")
+ throw e
+ }
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
// mock api impl for online fetching and uploading | |
val inMemKvStoreFunc: () => KVStore = () => { | |
// cannot reuse the variable - or serialization error | |
val result = InMemoryKvStore.build(namespace, () => null) | |
result | |
} | |
val inMemoryApi = new MockApi(inMemKvStoreFunc, namespace) | |
// mock api impl for online fetching and uploading | |
val inMemKvStoreFunc: () => KVStore = () => { | |
try { | |
InMemoryKvStore.build(namespace, () => throw new UnsupportedOperationException("Mock store doesn't support this operation")) | |
} catch { | |
case e: Exception => | |
logger.error(s"Failed to initialize InMemoryKvStore: ${e.getMessage}") | |
throw e | |
} | |
} | |
val inMemoryApi = new MockApi(inMemKvStoreFunc, namespace) |
class Conf(arguments: Seq[String]) extends ScallopConf(arguments) { | ||
val startDs: ScallopOption[String] = opt[String]( | ||
name = "start-ds", | ||
default = Some("2023-01-01"), | ||
descr = "Start date in YYYY-MM-DD format" | ||
) | ||
|
||
val endDs: ScallopOption[String] = opt[String]( | ||
name = "end-ds", | ||
default = Some("2023-02-30"), | ||
descr = "End date in YYYY-MM-DD format" | ||
) | ||
|
||
val rowCount: ScallopOption[Int] = opt[Int]( | ||
name = "row-count", | ||
default = Some(700000), | ||
descr = "Number of rows to generate" | ||
) | ||
|
||
val namespace: ScallopOption[String] = opt[String]( | ||
name = "namespace", | ||
default = Some("observability_demo"), | ||
descr = "Namespace for the demo" | ||
) | ||
|
||
verify() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix configuration validation issues.
Several validation issues need to be addressed:
- The default end date "2023-02-30" is invalid as February never has 30 days
- Missing validation for date format
- Missing validation for minimum row count
Here's a suggested implementation:
class Conf(arguments: Seq[String]) extends ScallopConf(arguments) {
+ // Date format validation
+ private val dateFormat = "yyyy-MM-dd"
+ private def validateDate(date: String): Boolean = {
+ try {
+ java.time.LocalDate.parse(date, java.time.format.DateTimeFormatter.ofPattern(dateFormat))
+ true
+ } catch {
+ case _: Exception => false
+ }
+ }
+
val startDs: ScallopOption[String] = opt[String](
name = "start-ds",
default = Some("2023-01-01"),
- descr = "Start date in YYYY-MM-DD format"
+ descr = s"Start date in $dateFormat format",
+ validate = validateDate
)
val endDs: ScallopOption[String] = opt[String](
name = "end-ds",
- default = Some("2023-02-30"),
+ default = Some("2023-02-28"),
- descr = "End date in YYYY-MM-DD format"
+ descr = s"End date in $dateFormat format",
+ validate = validateDate
)
val rowCount: ScallopOption[Int] = opt[Int](
name = "row-count",
default = Some(700000),
- descr = "Number of rows to generate"
+ descr = "Number of rows to generate (minimum: 1)",
+ validate = _ > 0
)
// Add cross-validation for dates
+ validate(startDs, endDs) { (start, end) =>
+ val startDate = java.time.LocalDate.parse(start)
+ val endDate = java.time.LocalDate.parse(end)
+ if (startDate.isAfter(endDate)) {
+ Left("Start date must be before or equal to end date")
+ } else {
+ Right(())
+ }
+ }
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
class Conf(arguments: Seq[String]) extends ScallopConf(arguments) { | |
val startDs: ScallopOption[String] = opt[String]( | |
name = "start-ds", | |
default = Some("2023-01-01"), | |
descr = "Start date in YYYY-MM-DD format" | |
) | |
val endDs: ScallopOption[String] = opt[String]( | |
name = "end-ds", | |
default = Some("2023-02-30"), | |
descr = "End date in YYYY-MM-DD format" | |
) | |
val rowCount: ScallopOption[Int] = opt[Int]( | |
name = "row-count", | |
default = Some(700000), | |
descr = "Number of rows to generate" | |
) | |
val namespace: ScallopOption[String] = opt[String]( | |
name = "namespace", | |
default = Some("observability_demo"), | |
descr = "Namespace for the demo" | |
) | |
verify() | |
} | |
class Conf(arguments: Seq[String]) extends ScallopConf(arguments) { | |
// Date format validation | |
private val dateFormat = "yyyy-MM-dd" | |
private def validateDate(date: String): Boolean = { | |
try { | |
java.time.LocalDate.parse(date, java.time.format.DateTimeFormatter.ofPattern(dateFormat)) | |
true | |
} catch { | |
case _: Exception => false | |
} | |
} | |
val startDs: ScallopOption[String] = opt[String]( | |
name = "start-ds", | |
default = Some("2023-01-01"), | |
descr = s"Start date in $dateFormat format", | |
validate = validateDate | |
) | |
val endDs: ScallopOption[String] = opt[String]( | |
name = "end-ds", | |
default = Some("2023-02-28"), | |
descr = s"End date in $dateFormat format", | |
validate = validateDate | |
) | |
val rowCount: ScallopOption[Int] = opt[Int]( | |
name = "row-count", | |
default = Some(700000), | |
descr = "Number of rows to generate (minimum: 1)", | |
validate = _ > 0 | |
) | |
val namespace: ScallopOption[String] = opt[String]( | |
name = "namespace", | |
default = Some("observability_demo"), | |
descr = "Namespace for the demo" | |
) | |
// Add cross-validation for dates | |
validate(startDs, endDs) { (start, end) => | |
val startDate = java.time.LocalDate.parse(start) | |
val endDate = java.time.LocalDate.parse(end) | |
if (startDate.isAfter(endDate)) { | |
Left("Start date must be before or equal to end date") | |
} else { | |
Right(()) | |
} | |
} | |
verify() | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (6)
hub/app/controllers/TimeSeriesController.scala (6)
3-8
: Consider organizing imports for better readabilityThe imports could be organized into logical groups:
- Play Framework imports
- Chronon API imports
- Scala standard library imports
- Project-specific imports
-import ai.chronon.api.DriftMetric -import ai.chronon.api.Extensions.WindowOps -import ai.chronon.api.TileDriftSeries -import ai.chronon.api.TileSummarySeries -import ai.chronon.api.TimeUnit -import ai.chronon.api.Window +// Chronon API imports +import ai.chronon.api.{DriftMetric, TileDriftSeries, TileSummarySeries, TimeUnit, Window} +import ai.chronon.api.Extensions.WindowOps -import scala.concurrent.ExecutionContext -import scala.concurrent.Future -import scala.concurrent.duration._ -import scala.jdk.CollectionConverters._ -import scala.util.Failure -import scala.util.Success +// Scala imports +import scala.concurrent.{ExecutionContext, Future} +import scala.concurrent.duration._ +import scala.jdk.CollectionConverters._ +import scala.util.{Failure, Success}Also applies to: 15-20
53-62
: Update method documentation to reflect new parametersThe documentation for
fetchFeature
should be updated to include information about the newjoin
parameter and its purpose./** * Helps retrieve a time series (drift or skew) for a given feature. Time series is * retrieved between the start and end ts. Choice of granularity (raw, aggregate, percentiles) along with the * metric type (drift / skew) dictates the shape of the returned time series. + * @param join The join identifier used to locate the feature's data */
90-91
: Enhance error messages with format examplesThe error messages could be more helpful by including the expected format.
- case (None, _) => Future.successful(BadRequest(s"Unable to parse offset - $offset")) - case (_, None) => Future.successful(BadRequest("Invalid drift algorithm. Expect JSD, PSI or Hellinger")) + case (None, _) => Future.successful(BadRequest(s"Unable to parse offset - $offset. Expected format: <number>h or <number>d (e.g., 24h, 7d)")) + case (_, None) => Future.successful(BadRequest("Invalid drift algorithm. Expected one of: JSD, PSI, Hellinger"))
94-94
: Extract path transformation logic to a utility methodThe path transformation logic appears in multiple places and should be extracted to a utility method.
+ private def transformJoinPath(path: String): String = path.replaceFirst("\\.", "/") + private def doFetchJoinDrift(...) = { - val joinPath = name.replaceFirst("\\.", "/") + val joinPath = transformJoinPath(name)
221-223
: Add validation for timestamps and values in point conversionThe point conversion should validate that timestamps are not negative and values are within reasonable bounds.
val points = lhsList.zip(tileDriftSeries.timestamps.asScala).map { - case (v, ts) => TimeSeriesPoint(v, ts) + case (v, ts) if ts >= 0 => + require(v.isFinite, s"Invalid value: $v") + TimeSeriesPoint(v, ts) + case (_, ts) => + throw new IllegalArgumentException(s"Invalid timestamp: $ts") }
283-284
: Remove commented out codeThe commented out cases for "skew" and "ooc" should be removed if they are no longer needed. If they are planned for future implementation, consider adding a TODO comment with a ticket reference.
-// case Some("skew") => Some(Skew) -// case Some("ooc") => Some(Skew) + // TODO(TICKET-123): Add support for skew and out-of-control metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (4)
online/src/main/scala/ai/chronon/online/stats/DriftStore.scala (3)
77-78
: Consider adding error handling for serialization operations.While the ThreadLocal usage is correct, the serialization operations could benefit from explicit error handling to prevent potential runtime exceptions.
Consider wrapping the serialization operations in Try:
- val serializer: TSerializer = binarySerializer.get() - val tileKeyMap = tileKeysForJoin(joinConf, None, columnPrefix) - val requestContextMap: Map[GetRequest, SummaryRequestContext] = tileKeyMap.flatMap { + Try { + val serializer: TSerializer = binarySerializer.get() + val tileKeyMap = tileKeysForJoin(joinConf, None, columnPrefix) + val requestContextMap: Map[GetRequest, SummaryRequestContext] = tileKeyMap.flatMap {Also applies to: 91-91
203-211
: Consider handling potential initialization failures in ThreadLocal.While the ThreadLocal implementation is correct, the initialValue method could potentially throw exceptions during initialization.
Consider adding error handling in initialValue:
@transient lazy val binarySerializer: ThreadLocal[TSerializer] = new ThreadLocal[TSerializer] { - override def initialValue(): TSerializer = new TSerializer(new TBinaryProtocol.Factory()) + override def initialValue(): TSerializer = Try(new TSerializer(new TBinaryProtocol.Factory())).getOrElse { + // Log the error + throw new RuntimeException("Failed to initialize binary serializer") + } }
203-211
: Consider abstracting serialization strategy into KVStore interface.The serialization strategy (binary vs compact) could be encapsulated within the KVStore interface rather than being managed by the DriftStore. This would:
- Make the serialization strategy configurable per KVStore implementation
- Reduce coupling between DriftStore and serialization details
- Simplify testing by allowing mock serializers
spark/src/main/scala/ai/chronon/spark/stats/drift/Summarizer.scala (1)
Line range hint
1-479
: Consider documenting the new serialization strategyThe transition from mock data to generated data, coupled with the new binary serialization approach, represents a significant architectural change. Consider adding documentation that explains:
- The rationale behind choosing binary serialization
- Any performance implications
- Data format compatibility considerations
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
docker-init/demo/log4j2.properties
(0 hunks)online/src/main/scala/ai/chronon/online/stats/DriftStore.scala
(4 hunks)spark/src/main/scala/ai/chronon/spark/stats/drift/Summarizer.scala
(2 hunks)
💤 Files with no reviewable changes (1)
- docker-init/demo/log4j2.properties
🔇 Additional comments (3)
online/src/main/scala/ai/chronon/online/stats/DriftStore.scala (2)
15-16
: LGTM: Import changes align with serializer implementation.
The imports correctly reference the new binary serializer implementations.
213-234
: Consider making percentile labels more configurable and type-safe.
This comment reiterates a previous review suggestion about making the percentile labels more maintainable and flexible.
spark/src/main/scala/ai/chronon/spark/stats/drift/Summarizer.scala (1)
9-9
: Verify backward compatibility of the new serializer
The switch from compactSerializer
to binarySerializer
might affect the ability to read previously serialized data.
✅ Verification successful
Switch to binary serializer is intentional and safe
The change from compactSerializer
to binarySerializer
is actually a necessary fix. The DriftStore implementation explicitly mentions "crazy bug in compact protocol - do not change to compact" in the comments. The binarySerializer
is the only serializer defined in DriftStore and is used consistently across the codebase for both serialization and deserialization of drift metrics data.
Key findings:
- The binary protocol is explicitly chosen to avoid bugs in the compact protocol
- Both serialization and deserialization use the same TBinaryProtocol consistently
- No compatibility issues as there are no remaining uses of compact protocol for drift metrics
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for existing uses of compactSerializer and potential compatibility issues
# Search for any remaining uses of compactSerializer
rg "compactSerializer" --type scala
# Search for any serialization compatibility tests
rg -l "SerializationTest|SerializerTest" --type scala
Length of output: 350
Script:
#!/bin/bash
# Let's check for the usage and implementation of binarySerializer
rg "binarySerializer" --type scala -A 3
# Also check for any migration or compatibility related code
rg "Migration|Compatibility" --type scala
# Check the DriftStore implementation
fd "DriftStore.scala" --type f --exec cat {}
Length of output: 12153
val serializer = binarySerializer.get() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider moving serializer initialization outside the map operation
The serializer is currently initialized for each row in the RDD, which could impact performance. Consider moving the initialization outside the map operation and broadcasting it to all workers.
def packSummaryDf(df: DataFrame): (DataFrame, Unit) = {
+ val serializer = binarySerializer.get()
val packedRdd: RDD[sql.Row] = df.rdd.flatMap(func).map { tileRow =>
- val serializer = binarySerializer.get()
val partition = tileRow.partition
val timestamp = tileRow.tileTs
Committable suggestion skipped: line range outside the PR's diff.
trait Paginate { | ||
val defaultOffset = 0 | ||
val defaultLimit = 10 | ||
val maxLimit = 100 | ||
|
||
def paginateResults(results: Seq[Model], offset: Int, limit: Int): Seq[Model] = { | ||
def paginateResults[T](results: Seq[T], offset: Int, limit: Int): Seq[T] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!!
9688376
to
4965c8a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Outside diff range and nitpick comments (16)
docker-init/demo/README.md (1)
1-25
: Add architectural overview and data flow descriptionThe README should explain the new architecture connecting Spark drift metrics computation with Hub.
Add a new section at the beginning:
+# Architecture Overview + +This demo showcases the integration between: +1. Spark-based drift metrics computation +2. Hub's InMemoryKV store for data persistence +3. Frontend visualization + +## Data Flow +```mermaid +graph LR + A[Spark Job] -->|Compute Drift Metrics| B[Hub] + B -->|Serve via API| C[Frontend] + B -->|Serve via API| D[Streamlit] +``` + # Populate Observability Demo Data🧰 Tools
🪛 LanguageTool
[grammar] ~20-~20: Replace ‘exec’ with an appropriate verb.
Context: ...n build.sh once, and you can repeatedly exec to quickly visualize In first termina...(MD_NN)
[uncategorized] ~22-~22: You might be missing the article “the” here.
Context: ...peatedly exec to quickly visualize In first terminal:sbt spark/assembly
In seco...(AI_EN_LECTOR_MISSING_DETERMINER_THE)
[uncategorized] ~23-~23: You might be missing the article “the” here.
Context: ...irst terminal:sbt spark/assembly
In second terminal:./run.sh
to load the built ...(AI_EN_LECTOR_MISSING_DETERMINER_THE)
[uncategorized] ~24-~24: You might be missing the article “the” here.
Context: ...and serve the data on localhost:8181 In third terminal:streamlit run viz.py
(AI_EN_LECTOR_MISSING_DETERMINER_THE)
hub/app/controllers/SearchController.scala (2)
20-24
: Document API response type change in changelogThe endpoint now returns joins instead of models, which is a breaking change for API consumers. While the documentation is updated correctly, we should ensure this change is properly communicated.
Consider:
- Adding API version headers for backward compatibility
- Documenting the response schema change in the API changelog
- Updating any API documentation (e.g., Swagger/OpenAPI specs)
38-39
: Consider caching search resultsGiven that this endpoint serves the frontend and might be called frequently, consider implementing caching for the search results to improve performance.
Consider:
- Adding response caching with appropriate cache invalidation
- Using Play's built-in caching mechanisms
- Adding cache headers in the response
hub/app/model/Model.scala (1)
5-10
: Fix parameter indentation in Join case classWhile the new fields (online, production, team) are logically sound additions, the parameter indentation is inconsistent.
Consider applying this formatting:
-case class Join(name: String, - joinFeatures: Seq[String], - groupBys: Seq[GroupBy], - online: Boolean, - production: Boolean, - team: String) +case class Join( + name: String, + joinFeatures: Seq[String], + groupBys: Seq[GroupBy], + online: Boolean, + production: Boolean, + team: String +)spark/src/main/scala/ai/chronon/spark/scripts/DataServer.scala (3)
Line range hint
39-46
: Add error handling for serialization operationsThe serialization logic could fail due to various reasons (invalid data, memory issues). Consider wrapping the operations in a try-catch block to handle potential failures gracefully.
Here's a suggested improvement:
private def convertToBytesMap[T <: TBase[_, _]: Manifest: ClassTag]( series: T, keyF: T => TileSeriesKey): Map[String, String] = { - val serializerInstance = DriftStore.binarySerializer.get() - val encoder = Base64.getEncoder - val keyBytes = serializerInstance.serialize(keyF(series)) - val valueBytes = serializerInstance.serialize(series) - Map( - "keyBytes" -> encoder.encodeToString(keyBytes), - "valueBytes" -> encoder.encodeToString(valueBytes) - ) + try { + val serializerInstance = DriftStore.binarySerializer.get() + val encoder = Base64.getEncoder + val keyBytes = serializerInstance.serialize(keyF(series)) + val valueBytes = serializerInstance.serialize(series) + Map( + "keyBytes" -> encoder.encodeToString(keyBytes), + "valueBytes" -> encoder.encodeToString(valueBytes) + ) + } catch { + case e: Exception => + logger.error(s"Failed to serialize series: ${e.getMessage}", e) + throw new RuntimeException("Failed to serialize data", e) + }
Line range hint
39-46
: Consider performance optimization for high-throughput scenariosThe current implementation creates a new serializer instance and performs Base64 encoding for every request. For high-throughput scenarios, consider:
- Caching the encoder instance
- Reusing serializer instances where possible
- Evaluating if Base64 encoding is necessary (could potentially use binary responses)
Line range hint
58-59
: Inconsistent serialization approach between endpointsThe drift-series endpoint appears to bypass the byte serialization (commented code), while summary-series uses it. This inconsistency could lead to:
- Different data formats being served
- Potential integration issues with clients
- Maintenance challenges
Consider applying a consistent serialization approach across all endpoints.
online/src/main/scala/ai/chronon/online/stats/DriftStore.scala (1)
Line range hint
91-117
: Enhance error handling for deserialization failuresThe current error handling simply prints stack traces and drops failed entries. Consider:
- Adding proper error logging
- Implementing retry logic for transient failures
- Adding metrics for monitoring deserialization failures
- case Failure(exception) => exception.printStackTrace(); None + case Failure(exception) => + logger.error(s"Failed to deserialize summary: ${exception.getMessage}", exception) + metrics.incrementCounter("drift.deserialization.failures") + Nonebuild.sbt (2)
83-88
: Enhance the comment to specify which case classes use CirceThe comment provides context but could be more specific about which case classes or payloads are being serialized/deserialized.
139-140
: Create a tracking issue for HTTPKvStore removalThe comment indicates this dependency is temporary. Consider creating a tracking issue to ensure its removal doesn't get forgotten.
Would you like me to create a GitHub issue to track the removal of HTTPKvStore and this dependency?
hub/test/controllers/TimeSeriesControllerSpec.scala (3)
76-84
: Enhance error handling test coverageConsider verifying the error message in the response body to ensure proper error propagation.
val result = controller.fetchJoin("my_join", startTs, endTs, "drift", "null", Some("10h"), Some("psi")).apply(FakeRequest()) status(result) mustBe INTERNAL_SERVER_ERROR +val bodyText = contentAsString(result) +bodyText must include("Some internal error")
Line range hint
87-109
: Strengthen assertions in success case testThe test could be more specific about the expected values in the response. Consider adding assertions for:
- The actual values in the response items
- The structure of the grouped items
- The timestamp values
254-282
: Enhance mock data generationThe current implementation uses fixed values for drift metrics (0.12, 0.23, 0.25). Consider:
- Using a random number generator with fixed seed for reproducibility
- Adding more variation in the mock data
- Adding documentation about the mock data pattern
Example improvement:
+ private val random = new Random(42) // Fixed seed for reproducibility - val percentileDrifts = if (isNumeric) List.fill(timestamps.size())(JDouble.valueOf(0.12)).asJava else List.fill[JDouble](timestamps.size())(null).asJava + val percentileDrifts = if (isNumeric) + timestamps.map(_ => JDouble.valueOf(0.1 + random.nextDouble() * 0.1)).asJava + else + List.fill[JDouble](timestamps.size())(null).asJavahub/app/controllers/TimeSeriesController.scala (3)
175-202
: Simplify complex pattern matching with for-comprehensionThe nested pattern matching for handling multiple Future results can be simplified.
- (maybeCurrentSummarySeries, maybeBaselineSummarySeries) match { - case (Failure(exceptionA), Failure(exceptionB)) => - Future.successful(InternalServerError( - s"Error computing feature percentiles for current + offset time window.\nCurrent window error: ${exceptionA.getMessage}\nOffset window error: ${exceptionB.getMessage}")) - case (_, Failure(exception)) => - Future.successful( - InternalServerError( - s"Error computing feature percentiles for offset time window - ${exception.getMessage}")) - case (Failure(exception), _) => - Future.successful( - InternalServerError( - s"Error computing feature percentiles for current time window - ${exception.getMessage}")) - case (Success(currentSummarySeriesFuture), Success(baselineSummarySeriesFuture)) => + val result = for { + current <- Future.fromTry(maybeCurrentSummarySeries).flatten + baseline <- Future.fromTry(maybeBaselineSummarySeries).flatten + } yield { + val currentFeatureTs = if (current.isEmpty) Seq.empty + else convertTileSummarySeriesToTimeSeries(current.head, metric) + val baselineFeatureTs = if (baseline.isEmpty) Seq.empty + else convertTileSummarySeriesToTimeSeries(baseline.head, metric) + val comparedTsData = ComparedFeatureTimeSeries(name, baselineFeatureTs, currentFeatureTs) + Ok(comparedTsData.asJson.noSpaces) + } + result.recover { + case e: Exception => + InternalServerError(s"Error computing feature percentiles: ${e.getMessage}") + }
283-285
: Remove commented out codeThe commented cases for "skew" and "ooc" should be removed if they're not being used. If they're planned for future implementation, consider tracking this in a GitHub issue instead.
-// case Some("skew") => Some(Skew) -// case Some("ooc") => Some(Skew)
26-28
: Consider adding performance monitoringWith the transition to bulk HTTP calls and InMemoryKV store, consider adding metrics to monitor:
- Response times for bulk operations
- Memory usage patterns
- Cache hit/miss rates
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (34)
.github/workflows/test_scala_no_spark.yaml
(1 hunks)build.sbt
(4 hunks)docker-init/Dockerfile
(1 hunks)docker-init/demo/README.md
(1 hunks)docker-init/demo/load_summaries.sh
(1 hunks)docker-init/demo/log4j2.properties
(0 hunks)docker-init/start.sh
(0 hunks)hub/app/controllers/InMemKVStoreController.scala
(1 hunks)hub/app/controllers/JoinController.scala
(1 hunks)hub/app/controllers/ModelController.scala
(1 hunks)hub/app/controllers/Paginate.scala
(1 hunks)hub/app/controllers/SearchController.scala
(2 hunks)hub/app/controllers/TimeSeriesController.scala
(3 hunks)hub/app/model/Model.scala
(2 hunks)hub/app/module/DriftStoreModule.scala
(1 hunks)hub/app/module/DynamoDBModule.scala
(0 hunks)hub/app/module/InMemoryKVStoreModule.scala
(1 hunks)hub/app/module/ModelStoreModule.scala
(1 hunks)hub/app/store/MonitoringModelStore.scala
(4 hunks)hub/conf/application.conf
(1 hunks)hub/conf/routes
(1 hunks)hub/test/controllers/JoinControllerSpec.scala
(1 hunks)hub/test/controllers/ModelControllerSpec.scala
(3 hunks)hub/test/controllers/SearchControllerSpec.scala
(4 hunks)hub/test/controllers/TimeSeriesControllerSpec.scala
(4 hunks)hub/test/store/MonitoringModelStoreTest.scala
(2 hunks)online/src/main/scala/ai/chronon/online/HTTPKVStore.scala
(1 hunks)online/src/main/scala/ai/chronon/online/MetadataEndPoint.scala
(1 hunks)online/src/main/scala/ai/chronon/online/stats/DriftStore.scala
(4 hunks)online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala
(1 hunks)spark/src/main/scala/ai/chronon/spark/Driver.scala
(2 hunks)spark/src/main/scala/ai/chronon/spark/scripts/DataServer.scala
(1 hunks)spark/src/main/scala/ai/chronon/spark/scripts/ObservabilityDemoDataLoader.scala
(1 hunks)spark/src/main/scala/ai/chronon/spark/stats/drift/Summarizer.scala
(2 hunks)
💤 Files with no reviewable changes (3)
- docker-init/demo/log4j2.properties
- docker-init/start.sh
- hub/app/module/DynamoDBModule.scala
🚧 Files skipped from review as they are similar to previous changes (19)
- .github/workflows/test_scala_no_spark.yaml
- docker-init/Dockerfile
- hub/app/controllers/InMemKVStoreController.scala
- hub/app/controllers/JoinController.scala
- hub/app/controllers/ModelController.scala
- hub/app/controllers/Paginate.scala
- hub/app/module/DriftStoreModule.scala
- hub/app/module/InMemoryKVStoreModule.scala
- hub/app/module/ModelStoreModule.scala
- hub/app/store/MonitoringModelStore.scala
- hub/conf/application.conf
- hub/test/controllers/JoinControllerSpec.scala
- hub/test/controllers/ModelControllerSpec.scala
- hub/test/controllers/SearchControllerSpec.scala
- hub/test/store/MonitoringModelStoreTest.scala
- online/src/main/scala/ai/chronon/online/MetadataEndPoint.scala
- online/src/main/scala/ai/chronon/online/stats/TileDriftCalculator.scala
- spark/src/main/scala/ai/chronon/spark/Driver.scala
- spark/src/main/scala/ai/chronon/spark/stats/drift/Summarizer.scala
🧰 Additional context used
📓 Learnings (1)
hub/test/controllers/TimeSeriesControllerSpec.scala (1)
Learnt from: piyush-zlai
PR: zipline-ai/chronon#22
File: hub/test/controllers/TimeSeriesControllerSpec.scala:40-49
Timestamp: 2024-11-12T09:38:33.532Z
Learning: When testing with mocked data, it's acceptable to not add detailed assertions on the data contents.
🪛 LanguageTool
docker-init/demo/README.md
[grammar] ~20-~20: Replace ‘exec’ with an appropriate verb.
Context: ...n build.sh once, and you can repeatedly exec to quickly visualize In first termina...
(MD_NN)
[uncategorized] ~22-~22: You might be missing the article “the” here.
Context: ...peatedly exec to quickly visualize In first terminal: sbt spark/assembly
In seco...
(AI_EN_LECTOR_MISSING_DETERMINER_THE)
[uncategorized] ~23-~23: You might be missing the article “the” here.
Context: ...irst terminal: sbt spark/assembly
In second terminal: ./run.sh
to load the built ...
(AI_EN_LECTOR_MISSING_DETERMINER_THE)
[uncategorized] ~24-~24: You might be missing the article “the” here.
Context: ...and serve the data on localhost:8181 In third terminal: streamlit run viz.py
(AI_EN_LECTOR_MISSING_DETERMINER_THE)
🪛 Shellcheck (0.10.0)
docker-init/demo/load_summaries.sh
[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.
(SC2148)
🔇 Additional comments (33)
docker-init/demo/load_summaries.sh (2)
1-2
: Add shell script essentials
The script is missing crucial shell script components that were previously flagged.
Apply this diff to improve the script header:
+#!/bin/bash
+
+# Script: load_summaries.sh
+# Purpose: Loads summary data for ObservabilityDemo using Spark
+#
+# Requirements:
+# - Docker and docker-compose must be running
+# - Containers must be up (see docker-init/demo/README.md)
+#
+# Usage: ./load_summaries.sh
+
# Kick off the ObsDemo spark job in the app container
🧰 Tools
🪛 Shellcheck (0.10.0)
[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.
(SC2148)
3-12
: 🛠️ Refactor suggestion
Optimize Spark configuration for demo environment
There are several issues with the current Spark configuration:
- Memory settings are duplicated (
--driver-memory
andspark.driver.memory
) - The allocated memory (8GB) might be excessive for a demo environment
- The class path and JAR locations should be verified
Suggested improvements:
docker-compose -f docker-init/compose.yaml exec app /opt/spark/bin/spark-submit \
--master "local[*]" \
- --driver-memory 8g \
+ --driver-memory "${DRIVER_MEMORY:-1g}" \
--conf "spark.driver.maxResultSize=6g" \
- --conf "spark.driver.memory=8g" \
--driver-class-path "/opt/spark/jars/*:/app/cli/*" \
--conf "spark.driver.host=localhost" \
--conf "spark.driver.bindAddress=0.0.0.0" \
--class ai.chronon.spark.scripts.ObservabilityDemoDataLoader \
/app/cli/spark.jar
Consider:
- Using environment variables for memory configuration
- Removing the duplicate memory setting
- Adding error handling for missing JARs or class
docker-init/demo/README.md (1)
12-17
: Documentation needs more details about the data loading process
The data loading section requires more detailed information about the process, expected duration, verification steps, and troubleshooting guidance.
hub/app/controllers/SearchController.scala (1)
5-8
: Verify MonitoringModelStore binding in dependency injection module
The transition from DynamoDBMonitoringStore to MonitoringModelStore looks good and aligns with the PR objectives to address performance issues. However, we should verify the DI setup.
Also applies to: 15-15
✅ Verification successful
Based on the analysis of the code, I can now generate the final response:
DI setup for MonitoringModelStore is correctly configured
The dependency injection setup is properly implemented:
ModelStoreModule
correctly bindsMonitoringModelStore
using the new abstraction- The transition from DynamoDB is handled through the
AwsApiImpl
which provides the underlying KVStore implementation - The store abstraction allows for different KVStore implementations (MongoKvStore, HTTPKVStore, etc.) making it flexible for testing and future changes
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the MonitoringModelStore binding in the DI module
# Expected: Find the module configuration that binds MonitoringModelStore
# Search for the DI module configuration
rg -t scala "bind.*MonitoringModelStore"
# Verify no remaining references to DynamoDBMonitoringStore
rg -t scala "DynamoDBMonitoringStore"
Length of output: 194
Script:
#!/bin/bash
# Let's verify the complete DI setup and implementation
# 1. Check the ModelStoreModule implementation
cat hub/app/module/ModelStoreModule.scala
# 2. Check if there are any remaining DynamoDB related imports or references
rg -t scala "DynamoDB|dynamodb" -A 2 -B 2
# 3. Look for the MonitoringModelStore trait/class definition
ast-grep --pattern 'trait MonitoringModelStore {
$$$
}'
ast-grep --pattern 'class MonitoringModelStore {
$$$
}'
Length of output: 20441
Script:
#!/bin/bash
# Let's check the MonitoringModelStore implementation and its relationship with DynamoDB
cat hub/app/store/MonitoringModelStore.scala
# Also check if there are any other store implementations that might be relevant
fd -t f ".*Store.scala" -x cat {}
Length of output: 44933
hub/app/model/Model.scala (1)
57-60
: Clarify the distinction between List and Search responses
A few concerns about the response classes:
- The comment "Currently search only covers joins" seems to conflict with having
ListModelResponse
SearchJoinResponse
andListJoinResponse
have identical structures. Consider:- Merging them if they serve the same purpose
- Adding documentation to clarify their distinct use cases if they differ
Let's check how these response types are used:
online/src/main/scala/ai/chronon/online/HTTPKVStore.scala (6)
10-11
: Remove or clarify the "hacky" designation
Based on the PR objectives, this code is used in production for drift metrics computation. The "hacky test" designation is inappropriate for production code.
13-14
: Consider making backend lazy or transient to avoid serialization issues
Given this code's integration with Spark (as per PR objectives), the current backend initialization could cause serialization issues in Spark executors.
16-16
: Implement or document unimplemented methods
The multiGet
and bulkPut
methods are marked with ???
. Given this is for production drift metrics computation, these methods need proper implementation or documentation.
Also applies to: 39-39
41-43
: Improve create method implementation
The current implementation silently skips dataset creation which could lead to issues in production.
18-37
:
Add retry mechanism to prevent data loss
Given the PR objectives mention previous issues with DynamoDB ingestion and data loss, this implementation should include a retry mechanism with exponential backoff.
+ import scala.concurrent.duration._
+ import scala.annotation.tailrec
+
override def multiPut(putRequests: collection.Seq[KVStore.PutRequest]): Future[collection.Seq[Boolean]] = {
if (putRequests.isEmpty) {
Future.successful(Seq.empty)
} else {
+ @tailrec
+ def retryWithBackoff(attempts: Int, backoff: Duration): (StatusCode, String) = {
+ if (attempts <= 0) {
+ throw new RuntimeException("Max retry attempts exceeded")
+ }
+
val response = basicRequest
.post(uri"$baseUrl/data")
.header("Content-Type", "application/json")
.body(jsonList(putRequests))
+ .timeout(30.seconds)
.send(backend)
+
+ response.code match {
+ case StatusCode.Ok => (response.code, response.body)
+ case StatusCode.TooManyRequests | StatusCode.ServiceUnavailable =>
+ Thread.sleep(backoff.toMillis)
+ retryWithBackoff(attempts - 1, backoff * 2)
+ case _ => (response.code, response.body)
+ }
+ }
+
+ Future {
+ retryWithBackoff(3, 1.second)
}.map { response =>
- response.code match {
+ response._1 match {
case StatusCode.Ok => Seq(true)
case _ =>
- logger.error(s"HTTP multiPut failed with status ${response.code}: ${response.body}")
+ logger.error(s"HTTP multiPut failed with status ${response._1}: ${response._2}")
Seq(false)
}
}
}
}
Likely invalid or redundant comment.
46-56
:
Add input validation for dataset field
The dataset field is directly interpolated into the JSON string without validation, which could lead to JSON injection vulnerabilities.
def jsonString(request: PutRequest): String = {
+ require(request.dataset.matches("[a-zA-Z0-9_-]+"), "Dataset name contains invalid characters")
+
val keyBase64 = Base64.getEncoder.encodeToString(request.keyBytes)
val valueBase64 = Base64.getEncoder.encodeToString(request.valueBytes)
- s"""{ "keyBytes": "${keyBase64}", "valueBytes": "${valueBase64}", "dataset": "${request.dataset}", "tsMillis": ${request.tsMillis.orNull}}""".stripMargin
+ s"""{ "keyBytes": "${keyBase64}", "valueBytes": "${valueBase64}", "dataset": "${request.dataset.replaceAll("\"", "\\\"")}", "tsMillis": ${request.tsMillis.map(_.toString).getOrElse("null")}}""".stripMargin
}
Likely invalid or redundant comment.
hub/conf/routes (3)
4-5
: LGTM! Well-structured RESTful endpoints for join operations.
The new join-related endpoints follow RESTful conventions and properly handle pagination.
19-20
:
Address memory management concerns and clarify temporary status.
While this endpoint addresses the DynamoDB ingestion issues mentioned in the PR objectives, there are important considerations:
-
Memory Management (previously raised):
- No size limits on the in-memory store
- Missing eviction policy
- No persistence mechanism
- Potential OOM risks with bulk operations
-
Temporary Solution Clarity:
- Define criteria for when this temporary solution will be replaced
- Document data lifecycle and cleanup strategy
- Add monitoring for memory usage
Consider adding configuration parameters to application.conf
:
+ # In-memory store configuration
+ inmemory.store {
+ # Maximum size in MB
+ max-size = 512
+ # Eviction policy: LRU, FIFO, etc.
+ eviction-policy = "LRU"
+ # Cleanup interval in minutes
+ cleanup-interval = 60
+ }
8-9
: Please clarify the plan for the TBD model timeseries endpoint.
While commenting out the endpoint aligns with the transition from a model-based approach, leaving it as "TBD" without a clear plan could create technical debt. Consider:
- Adding a timeline for resolving the TBD status
- Documenting whether this endpoint will be replaced or removed
- Creating a tracking issue for this work
Let's check if this endpoint is still referenced elsewhere:
spark/src/main/scala/ai/chronon/spark/scripts/ObservabilityDemoDataLoader.scala (5)
20-22
: LGTM! Well-structured logger initialization.
The logger is properly initialized using SLF4J with appropriate transient and lazy modifiers to handle serialization correctly.
31-57
: Fix configuration validation issues.
The configuration class has several validation issues that need to be addressed, including invalid default end date and missing validations.
This issue was previously identified in a past review comment. The same concerns about date validation and minimum row count validation remain valid.
67-70
: Avoid hardcoding Spark configuration.
This issue was previously identified in a past review comment. The same concern about hardcoded local mode remains valid.
80-87
: Improve error handling for KVStore initialization.
This issue was previously identified in a past review comment. The same concerns about null function and error handling remain valid.
110-119
: Improve cleanup and error handling.
This issue was previously identified in a past review comment. The same concerns about System.exit and proper cleanup remain valid.
spark/src/main/scala/ai/chronon/spark/scripts/DataServer.scala (1)
39-39
: Verify thread safety of DriftStore.binarySerializer
The code obtains a new serializer instance for each request. Let's verify that the serializer implementation is thread-safe.
✅ Verification successful
Based on the search results, I found the implementation of DriftStore.binarySerializer
in online/src/main/scala/ai/chronon/online/stats/DriftStore.scala
:
lazy val binarySerializer: ThreadLocal[TSerializer] = new ThreadLocal[TSerializer] {
override def initialValue(): TSerializer = new TSerializer()
}
Thread safety verified for DriftStore.binarySerializer
The implementation uses ThreadLocal
which provides thread-safe access by maintaining a separate TSerializer
instance for each thread. This is a correct and thread-safe approach for managing serializer instances that are not thread-safe themselves.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for DriftStore implementation to verify thread safety
ast-grep --pattern 'class DriftStore {
$$$
val binarySerializer = $$$
$$$
}'
# Look for any synchronization or thread-safety related comments/annotations
rg -i "thread|synchronized|concurrent" --type scala
Length of output: 18709
online/src/main/scala/ai/chronon/online/stats/DriftStore.scala (4)
15-16
: LGTM: Import changes align with serializer implementation
The imports correctly reference the new binary serializer implementations.
203-211
: LGTM: Efficient thread-safe serializer implementation
Good improvements:
- Thread-local implementation ensures thread safety
- @transient annotation prevents serialization issues
- Binary protocol is more efficient than compact protocol
- Lazy initialization prevents unnecessary object creation
213-234
: Consider making percentile labels more configurable and type-safe
The hard-coded sequence of percentile labels should be made more maintainable and flexible, as suggested in the previous review.
77-78
: Verify thread-safety of serializer usage in multi-get operations
While the thread-local implementation is correct, ensure that the multiGet
operation in kvStore
properly handles concurrent requests, as this is in a performance-critical path.
✅ Verification successful
Based on the analysis of the shell script outputs, I can now generate the final response:
Thread-safety of serializer usage is properly implemented
The serializer usage in DriftStore.scala
is thread-safe because:
- The
binarySerializer
is implemented as aThreadLocal
variable, ensuring each thread gets its own instance - The
get()
method is used correctly to access the thread-local value - The serializer is used in a read-only manner for key serialization within the scope of the method
- The implementation follows the same pattern used in other parts of the codebase for thread-safe serializer handling (e.g.,
ThriftJsonCodec
,AvroCodec
)
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for multiGet implementations and concurrent access patterns
ast-grep --pattern 'def multiGet($$$) {
$$$
}'
Length of output: 50
Script:
#!/bin/bash
# Search for multiGet and related concurrent patterns with ripgrep for broader coverage
rg -U "multiGet|ConcurrentMap|synchronized|ThreadLocal" -A 5
Length of output: 20270
Script:
#!/bin/bash
# Look for KVStore implementations and thread safety patterns
rg "trait\s+KVStore|class\s+.*KVStore" -A 10
Length of output: 25348
Script:
#!/bin/bash
# Search for serializer usage patterns
rg "serializer" -A 3
Length of output: 27919
build.sbt (2)
268-271
: LGTM! Comprehensive logging exclusions
The additional log4j exclusions are well-structured and ensure consistent logging behavior across the application.
248-248
: Verify the impact of spark dependency on hub project
Adding spark as a dependency to hub is a significant change that could affect:
- Build times
- Artifact size
- Potential circular dependencies
Let's verify the dependency graph:
✅ Verification successful
Based on the output, I can now generate the final response:
No circular dependencies detected in spark and hub integration
The dependency analysis shows:
- The spark project depends on aggregator and online projects, not on hub
- The hub project depends on cloud_aws and spark, forming a linear dependency chain
- The spark project has proper test and compile scope configurations
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for circular dependencies and analyze dependency tree
# Note: This will help identify any potential issues with the new dependency structure
# Check if there are any circular dependencies
sbt "hub/dependencyTree"
# Check the size impact
sbt "hub/assembly"
ls -lh hub/target/scala-*/hub-assembly-*.jar
Length of output: 365
Script:
#!/bin/bash
# Let's analyze the dependency structure using available tools
# Check if spark is referenced in other projects' dependencies
rg -l "dependsOn.*spark.*)"
# Look for any potential circular references where spark depends on hub
rg -l "dependsOn.*hub.*)"
# Check build.sbt for spark project definition and its dependencies
fd "build.sbt" --exec grep -A 10 "lazy val spark" {}
# Look for any other potential dependency relationships
fd "build.sbt" --exec grep -A 5 "dependsOn" {}
Length of output: 3218
hub/test/controllers/TimeSeriesControllerSpec.scala (3)
275-275
: Typo in variable name nullRationChangePercents
The variable name has a typo: "Ration" should be "Ratio".
288-288
: Remove duplicate setNodeName
call
The setNodeName(join)
is called twice on the tileKey
object.
211-214
: 🛠️ Refactor suggestion
Consider edge cases in duration calculation
The expected length calculation assumes that the time difference will always result in whole hours. This might fail for timestamps that don't align perfectly with hour boundaries.
Consider using ceiling division to handle partial hours:
- val expectedLength = DriftStore.percentileLabels.length * expectedHours(startTs, endTs)
+ val expectedLength = DriftStore.percentileLabels.length * ((endTs - startTs + 3600000 - 1) / 3600000)
hub/app/controllers/TimeSeriesController.scala (4)
3-8
: LGTM! Constructor changes align with async processing requirements.
The addition of DriftStore
dependency and ExecutionContext
follows best practices for dependency injection and async processing in Play Framework.
Also applies to: 15-20, 26-28
90-99
: LGTM! Comprehensive error handling with specific messages.
The error handling provides clear, actionable messages for different failure scenarios.
215-220
:
Add null checks to prevent NullPointerException
in data conversion methods
In convertTileDriftSeriesInfoToTimeSeries
, accessing properties without null checks may lead to NullPointerException
.
- val isNumeric =
- tileDriftSeries.percentileDriftSeries.asScala != null && tileDriftSeries.percentileDriftSeries.asScala
- .exists(_ != null)
+ val percentileSeries = Option(tileDriftSeries.percentileDriftSeries).map(_.asScala).getOrElse(Seq.empty)
+ val histogramSeries = Option(tileDriftSeries.histogramDriftSeries).map(_.asScala).getOrElse(Seq.empty)
+ val isNumeric = percentileSeries.nonEmpty && percentileSeries.exists(_ != null)
46-46
: Verify backward compatibility of API changes
The removal of the slice parameter from fetchJoin
and addition of the join parameter to fetchFeature
might impact existing clients.
Also applies to: 53-62
✅ Verification successful
Let me check the actual usage context of these endpoints to better understand the impact.
API changes are safe and backward compatible
The verification shows that:
- The
slice
parameter was never part of thefetchJoin
endpoint signature in the routes or tests - The
join
parameter infetchFeature
is part of a new endpoint path/api/v1/join/:join/feature/:name/timeseries
while the old endpoints are commented out but preserved - All tests have been updated to use the new signatures consistently
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for existing usages of these endpoints
echo "Searching for fetchJoin and fetchFeature endpoint usages..."
rg -l 'fetchJoin.*slice' || rg -l 'fetchFeature[^j]'
Length of output: 297
Script:
#!/bin/bash
# Check the actual endpoint definitions and usages
echo "Checking routes file..."
cat hub/conf/routes | grep -E "fetchJoin|fetchFeature"
echo -e "\nChecking controller implementation..."
rg "fetchFeature" hub/app/controllers/TimeSeriesController.scala -A 5
echo -e "\nChecking test implementations..."
rg "fetchFeature|fetchJoin" hub/test/controllers/TimeSeriesControllerSpec.scala -A 5
Length of output: 8173
# Populate Observability Demo Data | ||
To populate the observability demo data: | ||
* Launch the set of docker containers: | ||
```bash | ||
~/workspace/chronon $ docker-compose -f docker-init/compose.yaml up --build | ||
... | ||
app-1 | [info] 2024-11-26 05:10:45,758 [main] INFO play.api.Play - Application started (Prod) (no global state) | ||
app-1 | [info] 2024-11-26 05:10:45,958 [main] INFO play.core.server.AkkaHttpServer - Listening for HTTP on /[0:0:0:0:0:0:0:0]:9000 | ||
``` | ||
(you can skip the --build if you don't wish to rebuild your code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Enhance Docker setup instructions
The setup instructions should include prerequisites and verification steps.
Add the following sections:
# Populate Observability Demo Data
+
+## Prerequisites
+- Docker v20.10 or higher
+- Docker Compose v2.0 or higher
+- Minimum 4GB RAM available for containers
+
+## Setup
To populate the observability demo data:
* Launch the set of docker containers:
```bash
~/workspace/chronon $ docker-compose -f docker-init/compose.yaml up --build
...
app-1 | [info] 2024-11-26 05:10:45,758 [main] INFO play.api.Play - Application started (Prod) (no global state)
app-1 | [info] 2024-11-26 05:10:45,958 [main] INFO play.core.server.AkkaHttpServer - Listening for HTTP on /[0:0:0:0:0:0:0:0]:9000
(you can skip the --build if you don't wish to rebuild your code)
+
+Expected startup time: ~2-3 minutes
+
+### Verify Setup
+Ensure all containers are running:
+bash +~/workspace/chronon $ docker-compose -f docker-init/compose.yaml ps +
+
+All services should show status as "running".
<!-- This is an auto-generated comment by CodeRabbit -->
# Streamlit local experimentation | ||
run build.sh once, and you can repeatedly exec to quickly visualize | ||
|
||
In first terminal: `sbt spark/assembly` | ||
In second terminal: `./run.sh` to load the built jar and serve the data on localhost:8181 | ||
In third terminal: `streamlit run viz.py` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Restructure and clarify Streamlit instructions
The Streamlit section needs better formatting, clearer instructions, and proper grammar.
Replace the section with:
-# Streamlit local experimentation
-run build.sh once, and you can repeatedly exec to quickly visualize
-
-In first terminal: `sbt spark/assembly`
-In second terminal: `./run.sh` to load the built jar and serve the data on localhost:8181
-In third terminal: `streamlit run viz.py`
+# Local Development with Streamlit
+
+For local experimentation with the Streamlit visualization:
+
+1. Build the Spark assembly (required once):
+ ```bash
+ ~/workspace/chronon $ sbt spark/assembly
+ ```
+
+2. Start the data server:
+ ```bash
+ ~/workspace/chronon $ ./run.sh
+ ```
+ This loads the built JAR and serves data at http://localhost:8181
+
+3. Launch the Streamlit application:
+ ```bash
+ ~/workspace/chronon $ streamlit run viz.py
+ ```
+ Access the visualization at http://localhost:8501
🧰 Tools
🪛 LanguageTool
[grammar] ~20-~20: Replace ‘exec’ with an appropriate verb.
Context: ...n build.sh once, and you can repeatedly exec to quickly visualize In first termina...
(MD_NN)
[uncategorized] ~22-~22: You might be missing the article “the” here.
Context: ...peatedly exec to quickly visualize In first terminal: sbt spark/assembly
In seco...
(AI_EN_LECTOR_MISSING_DETERMINER_THE)
[uncategorized] ~23-~23: You might be missing the article “the” here.
Context: ...irst terminal: sbt spark/assembly
In second terminal: ./run.sh
to load the built ...
(AI_EN_LECTOR_MISSING_DETERMINER_THE)
[uncategorized] ~24-~24: You might be missing the article “the” here.
Context: ...and serve the data on localhost:8181 In third terminal: streamlit run viz.py
(AI_EN_LECTOR_MISSING_DETERMINER_THE)
// a trivial search where we check the join name for similarity with the search term | ||
private def searchRegistry(term: String): Seq[Join] = { | ||
val joins = monitoringStore.getJoins | ||
joins.filter(j => j.name.contains(term)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider enhancing search functionality and error handling
The current implementation has several areas for improvement:
- Case-sensitive search might not be user-friendly
- No error handling for monitoringStore.getJoins
- Potential performance impact of filtering all joins in memory
Consider this improvement:
private def searchRegistry(term: String): Seq[Join] = {
- val joins = monitoringStore.getJoins
- joins.filter(j => j.name.contains(term))
+ try {
+ val normalizedTerm = term.toLowerCase
+ monitoringStore.getJoins.filter(j =>
+ j.name.toLowerCase.contains(normalizedTerm)
+ )
+ } catch {
+ case e: Exception =>
+ logger.error(s"Failed to search joins: ${e.getMessage}", e)
+ Seq.empty
+ }
}
Committable suggestion skipped: line range outside the PR's diff.
def time(message: String)(block: => Unit): Unit = { | ||
logger.info(s"$message..".yellow) | ||
val start = System.currentTimeMillis() | ||
block | ||
val end = System.currentTimeMillis() | ||
logger.info(s"$message took ${end - start} ms".green) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling to timing utility.
The timing utility should handle exceptions while preserving timing information.
Consider this improvement:
- def time(message: String)(block: => Unit): Unit = {
+ def time[T](message: String)(block: => T): T = {
logger.info(s"$message..".yellow)
val start = System.currentTimeMillis()
- block
- val end = System.currentTimeMillis()
- logger.info(s"$message took ${end - start} ms".green)
+ try {
+ val result = block
+ val end = System.currentTimeMillis()
+ logger.info(s"$message took ${end - start} ms".green)
+ result
+ } catch {
+ case e: Exception =>
+ val end = System.currentTimeMillis()
+ logger.error(s"$message failed after ${end - start} ms: ${e.getMessage}".red)
+ throw e
+ }
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def time(message: String)(block: => Unit): Unit = { | |
logger.info(s"$message..".yellow) | |
val start = System.currentTimeMillis() | |
block | |
val end = System.currentTimeMillis() | |
logger.info(s"$message took ${end - start} ms".green) | |
} | |
def time[T](message: String)(block: => T): T = { | |
logger.info(s"$message..".yellow) | |
val start = System.currentTimeMillis() | |
try { | |
val result = block | |
val end = System.currentTimeMillis() | |
logger.info(s"$message took ${end - start} ms".green) | |
result | |
} catch { | |
case e: Exception => | |
val end = System.currentTimeMillis() | |
logger.error(s"$message failed after ${end - start} ms: ${e.getMessage}".red) | |
throw e | |
} | |
} |
val httpKvStoreFunc: () => KVStore = () => { | ||
// cannot reuse the variable - or serialization error | ||
val result = new HTTPKVStore() | ||
result | ||
} | ||
val hubApi = new MockApi(httpKvStoreFunc, namespace) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure thread-safe access to shared KV store.
The KV store initialization and usage pattern could lead to thread safety issues in a concurrent environment.
Consider these improvements:
- Make the KV store thread-safe or ensure synchronized access
- Use connection pooling for HTTP KV store
- Add proper resource cleanup
val httpKvStoreFunc: () => KVStore = () => {
- // cannot reuse the variable - or serialization error
- val result = new HTTPKVStore()
- result
+ synchronized {
+ val store = new HTTPKVStore()
+ Runtime.getRuntime().addShutdownHook(new Thread(() => {
+ store.close()
+ }))
+ store
+ }
}
Committable suggestion skipped: line range outside the PR's diff.
@@ -565,7 +566,7 @@ object Driver { | |||
lazy val api: Api = impl(serializableProps) | |||
|
|||
def metaDataStore = | |||
new MetadataStore(impl(serializableProps).genKvStore, "ZIPLINE_METADATA", timeoutMillis = 10000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great! minor comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (11)
online/src/main/scala/ai/chronon/online/stats/DriftStore.scala (3)
Line range hint
77-91
: Consider moving serializer.get() inside the Future for better thread safetyWhile the deserializer is correctly obtained inside the Future's map operation, the serializer is obtained outside. This could lead to thread safety issues if the Future executes in a different thread pool.
Consider this improvement:
- val serializer: TSerializer = binarySerializer.get() val tileKeyMap = tileKeysForJoin(joinConf, None, columnPrefix) val requestContextMap: Map[GetRequest, SummaryRequestContext] = tileKeyMap.flatMap { case (group, keys) => keys.map { key => + val serializer = binarySerializer.get() val keyBytes = serializer.serialize(key) val get = GetRequest(keyBytes, summaryDataset, startTsMillis = startMs, endTsMillis = endMs) get -> SummaryRequestContext(get, key, group) } }
203-211
: LGTM: Thread-safe serialization implementationThe ThreadLocal implementation with @transient lazy val is correct and addresses thread safety concerns. The BinaryProtocol choice is well-justified.
Consider improving the warning comment to be more specific about the compact protocol bug:
- // crazy bug in compact protocol - do not change to compact + // WARNING: TCompactProtocol has known issues with concurrent access and data corruption. + // Use TBinaryProtocol for thread-safe serialization.
213-214
: Consider adding validation and edge case handlingGood implementation that addresses the previous review feedback about making percentile labels programmatic. However, consider these improvements:
- def breaks(count: Int): Seq[String] = (0 to count).map(_ * (100 / count)).map("p" + _.toString) + def breaks(count: Int): Seq[String] = { + require(count > 0, "Count must be positive") + require(count <= 100, "Count cannot exceed 100") + + if (count == 1) Seq("p0", "p100") + else (0 to count).map(i => "p" + (i * (100 / count)).toString) + }This adds:
- Input validation
- Special handling for count=1
- More readable formatting
hub/test/controllers/TimeSeriesControllerSpec.scala (4)
76-84
: Consider adding more specific error assertionsWhile the error handling test is good, consider asserting the error message or response body to ensure the correct error is being propagated.
- status(result) mustBe INTERNAL_SERVER_ERROR + val response = contentAsString(result) + status(result) mustBe INTERNAL_SERVER_ERROR + response must include("Some internal error")
Line range hint
87-110
: Add more specific assertions for drift metricsThe successful case test could benefit from more specific assertions on the drift metrics values to ensure the data transformation is correct.
Consider adding assertions for:
- Specific drift metric values
- Time series continuity
- Data point validity
189-194
: Extract magic numbers into named constantsThe offset of 7 days and other numeric values should be extracted into named constants for better maintainability and clarity.
+ private val DefaultOffsetDays = 7 + private val DefaultPercentileDrift = 0.12 + private val DefaultHistogramDrift = 0.23 + private val DefaultNullRatioChange = 0.25 - val offset = Duration.apply(7, TimeUnit.DAYS) + val offset = Duration.apply(DefaultOffsetDays, TimeUnit.DAYS)
254-282
: Consider parameterizing mock data generationThe mock data generation could be more flexible and maintainable by accepting parameters for the drift values instead of using hard-coded values.
Consider refactoring to:
private def generateDriftSeries( startTs: Long, endTs: Long, join: String, numGroups: Int, numFeaturesPerGroup: Int, + percentileDrift: Double = 0.12, + histogramDrift: Double = 0.23, + nullRatioChange: Double = 0.25 ): Try[Future[Seq[TileDriftSeries]]] = {hub/app/controllers/TimeSeriesController.scala (4)
53-62
: Document the new 'join' parameter in the method's scaladocThe scaladoc comment should be updated to explain the purpose and format of the new 'join' parameter.
90-112
: Simplify error handling with for-comprehensionThe nested pattern matching can be simplified using for-comprehension for better readability.
- (parseOffset(offset), parseAlgorithm(algorithm)) match { - case (None, _) => Future.successful(BadRequest(s"Unable to parse offset - $offset")) - case (_, None) => Future.successful(BadRequest("Invalid drift algorithm. Expect JSD, PSI or Hellinger")) - case (Some(o), Some(driftMetric)) => + for { + o <- Future.fromOption(parseOffset(offset))(BadRequest(s"Unable to parse offset - $offset")) + driftMetric <- Future.fromOption(parseAlgorithm(algorithm))(BadRequest("Invalid drift algorithm. Expect JSD, PSI or Hellinger")) + result <- { val window = new Window(o.toMinutes.toInt, TimeUnit.MINUTES) val joinPath = name.replaceFirst("\\.", "/") driftStore.getDriftSeries(joinPath, driftMetric, window, startTs, endTs) .fold( ex => Future.successful(InternalServerError(s"Error computing join drift - ${ex.getMessage}")), _.map { driftSeries => // ... rest of the success handling ... } ) + } + } yield result
171-202
: Consider extracting the percentile computation logicThe percentile computation logic is complex and would benefit from being extracted into a separate method for better maintainability.
+ private def computePercentiles( + currentSeries: Seq[TileSummarySeries], + baselineSeries: Seq[TileSummarySeries], + metric: Metric, + name: String + ): ComparedFeatureTimeSeries = { + val currentFeatureTs = { + if (currentSeries.isEmpty) Seq.empty + else convertTileSummarySeriesToTimeSeries(currentSeries.head, metric) + } + val baselineFeatureTs = { + if (baselineSeries.isEmpty) Seq.empty + else convertTileSummarySeriesToTimeSeries(baselineSeries.head, metric) + } + ComparedFeatureTimeSeries(name, baselineFeatureTs, currentFeatureTs) + }
283-285
: Remove commented out codeThe commented out cases for "skew" and "ooc" should be removed if they are no longer needed. If they are planned for future implementation, consider adding a TODO comment with a ticket reference.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
hub/app/controllers/TimeSeriesController.scala
(3 hunks)hub/test/controllers/TimeSeriesControllerSpec.scala
(4 hunks)online/src/main/scala/ai/chronon/online/stats/DriftStore.scala
(4 hunks)
🧰 Additional context used
📓 Learnings (1)
hub/test/controllers/TimeSeriesControllerSpec.scala (1)
Learnt from: piyush-zlai
PR: zipline-ai/chronon#22
File: hub/test/controllers/TimeSeriesControllerSpec.scala:40-49
Timestamp: 2024-11-12T09:38:33.532Z
Learning: When testing with mocked data, it's acceptable to not add detailed assertions on the data contents.
🔇 Additional comments (7)
online/src/main/scala/ai/chronon/online/stats/DriftStore.scala (1)
15-16
: LGTM: Import changes align with new serialization implementation
The new imports correctly reference the ThreadLocal binary serializers that replace the previous compact serializers.
hub/test/controllers/TimeSeriesControllerSpec.scala (3)
Line range hint 3-42
: LGTM: Well-structured test setup with proper mocking
The test setup follows best practices with clear organization of imports and proper initialization of mocked components.
275-275
: Fix typo in variable name
The variable name has a typo: nullRationChangePercents
should be nullRatioChangePercents
.
288-288
: Remove duplicate setNodeName call
The setNodeName(join)
is called twice on the tileKey
object.
hub/app/controllers/TimeSeriesController.scala (3)
26-28
: LGTM: Clean dependency injection setup
The controller's constructor properly uses dependency injection for DriftStore and follows Play Framework best practices with implicit ExecutionContext.
215-220
:
Add null checks to prevent NullPointerException
This is a duplicate of a previous review comment. The null handling issues for percentileDriftSeries
and histogramDriftSeries
still need to be addressed.
71-202
: Verify error handling paths for async operations
Let's verify that all error paths are properly handled and tested.
✅ Verification successful
Based on the results, I can now generate the final response since we have comprehensive information about the error handling patterns in the codebase.
Error handling implementation is comprehensive and well-structured
The error handling for async operations is properly implemented with:
- Consistent validation of input parameters (metric choice, rollup, granularity) with clear error messages
- Proper handling of computation failures in drift calculations
- Detailed error messages for both single and compound failure scenarios
- Appropriate HTTP status codes (400 for client errors, 500 for server errors)
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for test coverage of error scenarios
rg -A 5 "it should.*error" hub/test/controllers/TimeSeriesControllerSpec.scala
# Search for error handling patterns
ast-grep --pattern 'Future.successful(BadRequest($_))'
ast-grep --pattern 'Future.successful(InternalServerError($_))'
Length of output: 3361
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
hub/app/model/Model.scala (1)
57-60
: Add documentation for pagination parameters.The response classes have a consistent structure with offset-based pagination. Consider adding documentation to clarify:
- The expected offset values
- Any limits on the number of items returned
- Pagination behavior when offset exceeds available items
Example documentation:
+ /** Response for listing models with offset-based pagination + * @param offset Zero-based offset for pagination + * @param items Sequence of models, limited to maximum of N items + */ case class ListModelResponse(offset: Int, items: Seq[Model])
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
hub/app/model/Model.scala
(2 hunks)hub/app/store/MonitoringModelStore.scala
(4 hunks)hub/test/controllers/ModelControllerSpec.scala
(3 hunks)hub/test/controllers/SearchControllerSpec.scala
(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- hub/app/store/MonitoringModelStore.scala
- hub/test/controllers/ModelControllerSpec.scala
- hub/test/controllers/SearchControllerSpec.scala
🔇 Additional comments (1)
hub/app/model/Model.scala (1)
5-10
: LGTM! Verify the optional team field requirement.
The enhanced Join case class with online, production, and team fields creates better consistency with the Model case class structure. However, it's worth noting that team is optional (Option[String]) in Join but required (String) in Model.
Let's verify if this difference in team field requirement is intentional:
✅ Verification successful
Let me analyze the search results to verify the team field requirement in Join case class usage:
- Looking at test cases and actual usage:
- In
hub/test/controllers/SearchControllerSpec.scala
andModelControllerSpec.scala
, the team field is consistently used withSome("my_team")
- In
hub/app/store/MonitoringModelStore.scala
, the team field is populated fromthriftJoin.metaData.team
and wrapped inOption()
- Looking at Python API usage:
- Most Join instantiations don't explicitly set the team field
- When set, it's optional via
team_override
parameter (e.g., inapi/py/test/test_join.py
)
- Analyzing the pattern:
- The team field being optional (Option[String]) in Join is consistent with its usage across the codebase
- It allows flexibility where team information may or may not be available
- The difference from Model (which has required String) appears intentional for different use cases
The optional team field requirement is correctly implemented.
The Join case class's optional team field aligns with its usage patterns across the codebase, where team information is sometimes provided (wrapped in Some) and sometimes omitted. This design choice provides flexibility while maintaining compatibility with the thrift metadata structure.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for team field usage across the codebase
# Look for patterns where Join or Model objects are created to understand the team field usage
ast-grep --pattern 'new Join($$$)'
ast-grep --pattern 'Join($$$)'
Length of output: 45910
…ontend (#95) ## Summary Builds on a couple of the summary computation PRs and data generation to wire things up so that Hub can serve them. * Yanked out mock data based endpoints (model perf / drift, join & feature skew) - decided it would be confusing to have a mix of mock and generated data so we just have the generated data served * Dropped a few of the scripts introduced in #87. We bring up our containers the way and we have a script `load_summaries.sh` that we can trigger that leverages the existing app container to load data. * DDB ingestion was taking too long and we were dropping a lot of data due to rejected execution exceptions. To unblock for now, we've gone with an approach of making a bulk put HTTP call from the ObservabilityDemo app -> Hub and Hub utilizing a InMemoryKV store to persist and serve up features. * Added an endpoint to serve the join that are configured as we've switched from the model based world. There's still an issue to resolve around fetching individual feature series data. Once I resolve that, we can switch this PR out of wip mode. To test / run: start up our docker containers: ``` $ docker-compose -f docker-init/compose.yaml up --build ... ``` In a different term load data: ``` $ ./docker-init/demo/load_summaries.sh Done uploading summaries! 🥳 ``` You can now curl join & feature time series data. Join drift (null ratios) ``` curl -X GET 'http://localhost:9000/api/v1/join/risk.user_transactions.txn_join/timeseries?startTs=1673308800000&endTs=1674172800000&metricType=drift&metrics=null&offset=10h&algorithm=psi' ``` Join drift (value drift) ``` curl -X GET 'http://localhost:9000/api/v1/join/risk.user_transactions.txn_join/timeseries?startTs=1673308800000&endTs=1674172800000&metricType=drift&metrics=value&offset=10h&algorithm=psi' ``` Feature drift: ``` curl -X GET 'http://localhost:9000/api/v1/join/risk.user_transactions.txn_join/feature/dim_user_account_type/timeseries?startTs=1673308800000&endTs=1674172800000&metricType=drift&metrics=value&offset=1D&algorithm=psi&granularity=aggregates' ``` Feature summaries: ``` curl -X GET 'http://localhost:9000/api/v1/join/risk.user_transactions.txn_join/feature/dim_user_account_type/timeseries?startTs=1673308800000&endTs=1674172800000&metricType=drift&metrics=value&offset=1D&algorithm=psi&granularity=percentile' ``` Join metadata ``` curl -X GET 'http://localhost:9000/api/v1/joins' curl -X GET 'http://localhost:9000/api/v1/join/risk.user_transactions.txn_join' ``` ## Checklist - [X] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes - **New Features** - Introduced a new `JoinController` for managing joins with pagination support. - Added functionality for an in-memory key-value store with bulk data upload capabilities. - Implemented observability demo data loading within a Spark application. - Added a new `HTTPKVStore` class for remote key-value store interactions over HTTP. - **Improvements** - Enhanced the `ModelController` and `SearchController` to align with the new join data structure. - Updated the `TimeSeriesController` to support asynchronous operations and improved error handling. - Refined dependency management in the build configuration for better clarity and maintainability. - Updated API routes to include new endpoints for listing and retrieving joins. - Updated configuration to replace the `DynamoDBModule` with `ModelStoreModule`, adding `InMemoryKVStoreModule` and `DriftStoreModule`. - **Documentation** - Revised README instructions for Docker container setup and demo data loading. - Updated API routes documentation to reflect new endpoints for joins and in-memory data operations. - **Bug Fixes** - Resolved issues related to error handling in various controllers and improved logging for better traceability. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: nikhil-zlai <[email protected]>
…ontend (#95) ## Summary Builds on a couple of the summary computation PRs and data generation to wire things up so that Hub can serve them. * Yanked out mock data based endpoints (model perf / drift, join & feature skew) - decided it would be confusing to have a mix of mock and generated data so we just have the generated data served * Dropped a few of the scripts introduced in #87. We bring up our containers the way and we have a script `load_summaries.sh` that we can trigger that leverages the existing app container to load data. * DDB ingestion was taking too long and we were dropping a lot of data due to rejected execution exceptions. To unblock for now, we've gone with an approach of making a bulk put HTTP call from the ObservabilityDemo app -> Hub and Hub utilizing a InMemoryKV store to persist and serve up features. * Added an endpoint to serve the join that are configured as we've switched from the model based world. There's still an issue to resolve around fetching individual feature series data. Once I resolve that, we can switch this PR out of wip mode. To test / run: start up our docker containers: ``` $ docker-compose -f docker-init/compose.yaml up --build ... ``` In a different term load data: ``` $ ./docker-init/demo/load_summaries.sh Done uploading summaries! 🥳 ``` You can now curl join & feature time series data. Join drift (null ratios) ``` curl -X GET 'http://localhost:9000/api/v1/join/risk.user_transactions.txn_join/timeseries?startTs=1673308800000&endTs=1674172800000&metricType=drift&metrics=null&offset=10h&algorithm=psi' ``` Join drift (value drift) ``` curl -X GET 'http://localhost:9000/api/v1/join/risk.user_transactions.txn_join/timeseries?startTs=1673308800000&endTs=1674172800000&metricType=drift&metrics=value&offset=10h&algorithm=psi' ``` Feature drift: ``` curl -X GET 'http://localhost:9000/api/v1/join/risk.user_transactions.txn_join/feature/dim_user_account_type/timeseries?startTs=1673308800000&endTs=1674172800000&metricType=drift&metrics=value&offset=1D&algorithm=psi&granularity=aggregates' ``` Feature summaries: ``` curl -X GET 'http://localhost:9000/api/v1/join/risk.user_transactions.txn_join/feature/dim_user_account_type/timeseries?startTs=1673308800000&endTs=1674172800000&metricType=drift&metrics=value&offset=1D&algorithm=psi&granularity=percentile' ``` Join metadata ``` curl -X GET 'http://localhost:9000/api/v1/joins' curl -X GET 'http://localhost:9000/api/v1/join/risk.user_transactions.txn_join' ``` ## Checklist - [X] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes - **New Features** - Introduced a new `JoinController` for managing joins with pagination support. - Added functionality for an in-memory key-value store with bulk data upload capabilities. - Implemented observability demo data loading within a Spark application. - Added a new `HTTPKVStore` class for remote key-value store interactions over HTTP. - **Improvements** - Enhanced the `ModelController` and `SearchController` to align with the new join data structure. - Updated the `TimeSeriesController` to support asynchronous operations and improved error handling. - Refined dependency management in the build configuration for better clarity and maintainability. - Updated API routes to include new endpoints for listing and retrieving joins. - Updated configuration to replace the `DynamoDBModule` with `ModelStoreModule`, adding `InMemoryKVStoreModule` and `DriftStoreModule`. - **Documentation** - Revised README instructions for Docker container setup and demo data loading. - Updated API routes documentation to reflect new endpoints for joins and in-memory data operations. - **Bug Fixes** - Resolved issues related to error handling in various controllers and improved logging for better traceability. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: nikhil-zlai <[email protected]>
…ontend (#95) ## Summary Builds on a couple of the summary computation PRs and data generation to wire things up so that Hub can serve them. * Yanked out mock data based endpoints (model perf / drift, join & feature skew) - decided it would be confusing to have a mix of mock and generated data so we just have the generated data served * Dropped a few of the scripts introduced in #87. We bring up our containers the way and we have a script `load_summaries.sh` that we can trigger that leverages the existing app container to load data. * DDB ingestion was taking too long and we were dropping a lot of data due to rejected execution exceptions. To unblock for now, we've gone with an approach of making a bulk put HTTP call from the ObservabilityDemo app -> Hub and Hub utilizing a InMemoryKV store to persist and serve up features. * Added an endpoint to serve the join that are configured as we've switched from the model based world. There's still an issue to resolve around fetching individual feature series data. Once I resolve that, we can switch this PR out of wip mode. To test / run: start up our docker containers: ``` $ docker-compose -f docker-init/compose.yaml up --build ... ``` In a different term load data: ``` $ ./docker-init/demo/load_summaries.sh Done uploading summaries! 🥳 ``` You can now curl join & feature time series data. Join drift (null ratios) ``` curl -X GET 'http://localhost:9000/api/v1/join/risk.user_transactions.txn_join/timeseries?startTs=1673308800000&endTs=1674172800000&metricType=drift&metrics=null&offset=10h&algorithm=psi' ``` Join drift (value drift) ``` curl -X GET 'http://localhost:9000/api/v1/join/risk.user_transactions.txn_join/timeseries?startTs=1673308800000&endTs=1674172800000&metricType=drift&metrics=value&offset=10h&algorithm=psi' ``` Feature drift: ``` curl -X GET 'http://localhost:9000/api/v1/join/risk.user_transactions.txn_join/feature/dim_user_account_type/timeseries?startTs=1673308800000&endTs=1674172800000&metricType=drift&metrics=value&offset=1D&algorithm=psi&granularity=aggregates' ``` Feature summaries: ``` curl -X GET 'http://localhost:9000/api/v1/join/risk.user_transactions.txn_join/feature/dim_user_account_type/timeseries?startTs=1673308800000&endTs=1674172800000&metricType=drift&metrics=value&offset=1D&algorithm=psi&granularity=percentile' ``` Join metadata ``` curl -X GET 'http://localhost:9000/api/v1/joins' curl -X GET 'http://localhost:9000/api/v1/join/risk.user_transactions.txn_join' ``` ## Checklist - [X] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes - **New Features** - Introduced a new `JoinController` for managing joins with pagination support. - Added functionality for an in-memory key-value store with bulk data upload capabilities. - Implemented observability demo data loading within a Spark application. - Added a new `HTTPKVStore` class for remote key-value store interactions over HTTP. - **Improvements** - Enhanced the `ModelController` and `SearchController` to align with the new join data structure. - Updated the `TimeSeriesController` to support asynchronous operations and improved error handling. - Refined dependency management in the build configuration for better clarity and maintainability. - Updated API routes to include new endpoints for listing and retrieving joins. - Updated configuration to replace the `DynamoDBModule` with `ModelStoreModule`, adding `InMemoryKVStoreModule` and `DriftStoreModule`. - **Documentation** - Revised README instructions for Docker container setup and demo data loading. - Updated API routes documentation to reflect new endpoints for joins and in-memory data operations. - **Bug Fixes** - Resolved issues related to error handling in various controllers and improved logging for better traceability. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: nikhil-zlai <[email protected]>
…ontend (#95) ## Summary Builds on a couple of the summary computation PRs and data generation to wire things up so that Hub can serve them. * Yanked out moour clients data based endpoints (model perf / drift, join & feature skew) - decided it would be confusing to have a mix of moour clients and generated data so we just have the generated data served * Dropped a few of the scripts introduced in #87. We bring up our containers the way and we have a script `load_summaries.sh` that we can trigger that leverages the existing app container to load data. * DDB ingestion was taking too long and we were dropping a lot of data due to rejected execution exceptions. To unbloour clients for now, we've gone with an approach of making a bulk put HTTP call from the ObservabilityDemo app -> Hub and Hub utilizing a InMemoryKV store to persist and serve up features. * Added an endpoint to serve the join that are configured as we've switched from the model based world. There's still an issue to resolve around fetching individual feature series data. Once I resolve that, we can switch this PR out of wip mode. To test / run: start up our doour clientser containers: ``` $ doour clientser-compose -f doour clientser-init/compose.yaml up --build ... ``` In a different term load data: ``` $ ./doour clientser-init/demo/load_summaries.sh Done uploading summaries! 🥳 ``` You can now curl join & feature time series data. Join drift (null ratios) ``` curl -X GET 'http://localhost:9000/api/v1/join/risk.user_transactions.txn_join/timeseries?startTs=1673308800000&endTs=1674172800000&metricType=drift&metrics=null&offset=10h&algorithm=psi' ``` Join drift (value drift) ``` curl -X GET 'http://localhost:9000/api/v1/join/risk.user_transactions.txn_join/timeseries?startTs=1673308800000&endTs=1674172800000&metricType=drift&metrics=value&offset=10h&algorithm=psi' ``` Feature drift: ``` curl -X GET 'http://localhost:9000/api/v1/join/risk.user_transactions.txn_join/feature/dim_user_account_type/timeseries?startTs=1673308800000&endTs=1674172800000&metricType=drift&metrics=value&offset=1D&algorithm=psi&granularity=aggregates' ``` Feature summaries: ``` curl -X GET 'http://localhost:9000/api/v1/join/risk.user_transactions.txn_join/feature/dim_user_account_type/timeseries?startTs=1673308800000&endTs=1674172800000&metricType=drift&metrics=value&offset=1D&algorithm=psi&granularity=percentile' ``` Join metadata ``` curl -X GET 'http://localhost:9000/api/v1/joins' curl -X GET 'http://localhost:9000/api/v1/join/risk.user_transactions.txn_join' ``` ## Cheour clientslist - [X] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes - **New Features** - Introduced a new `JoinController` for managing joins with pagination support. - Added functionality for an in-memory key-value store with bulk data upload capabilities. - Implemented observability demo data loading within a Spark application. - Added a new `HTTPKVStore` class for remote key-value store interactions over HTTP. - **Improvements** - Enhanced the `ModelController` and `SearchController` to align with the new join data structure. - Updated the `TimeSeriesController` to support asynchronous operations and improved error handling. - Refined dependency management in the build configuration for better clarity and maintainability. - Updated API routes to include new endpoints for listing and retrieving joins. - Updated configuration to replace the `DynamoDBModule` with `ModelStoreModule`, adding `InMemoryKVStoreModule` and `DriftStoreModule`. - **Documentation** - Revised README instructions for Doour clientser container setup and demo data loading. - Updated API routes documentation to reflect new endpoints for joins and in-memory data operations. - **Bug Fixes** - Resolved issues related to error handling in various controllers and improved logging for better traceability. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: nikhil-zlai <[email protected]>
Summary
Builds on a couple of the summary computation PRs and data generation to wire things up so that Hub can serve them.
load_summaries.sh
that we can trigger that leverages the existing app container to load data.There's still an issue to resolve around fetching individual feature series data. Once I resolve that, we can switch this PR out of wip mode.
To test / run:
start up our docker containers:
In a different term load data:
You can now curl join & feature time series data.
Join drift (null ratios)
Join drift (value drift)
Feature drift:
Feature summaries:
Join metadata
Checklist
Summary by CodeRabbit
Release Notes
New Features
JoinController
for managing joins with pagination support.HTTPKVStore
class for remote key-value store interactions over HTTP.Improvements
ModelController
andSearchController
to align with the new join data structure.TimeSeriesController
to support asynchronous operations and improved error handling.DynamoDBModule
withModelStoreModule
, addingInMemoryKVStoreModule
andDriftStoreModule
.Documentation
Bug Fixes