-
Notifications
You must be signed in to change notification settings - Fork 0
Drop Spark BigTable version to unlock DataProc submission #167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis pull request focuses on enhancing the project's Docker and build configurations by updating dependencies and JAR file management. The changes involve modifying the Dockerfile to include custom JAR files for BigTable and Log4j, updating the Changes
Sequence DiagramsequenceDiagram
participant Dockerfile
participant build.sbt
participant load_data.sh
Dockerfile->>+Dockerfile: Create /opt/custom-jars
Dockerfile->>+Dockerfile: Download custom JARs
build.sbt->>+build.sbt: Update Bigtable dependency
load_data.sh->>+load_data.sh: Configure Spark submit with custom JARs
Possibly related PRs
Suggested Reviewers
Poem
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
quickstart/cloud_gcp/scripts/load_data.sh (1)
36-38
: Add error checks for required JARsCheck if required JARs exist before spark-submit.
+# Check for required JARs +for jar in "spark-bigtable_2.12-0.2.1.jar" "log4j-slf4j-impl-2.20.0.jar"; do + if [ ! -f "/opt/custom-jars/$jar" ]; then + echo "Error: Required JAR $jar not found in /opt/custom-jars/" >&2 + exit 1 + fi +done + if ! spark-submit --driver-class-path "$CLASSPATH:/opt/custom-jars/*" \
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (3)
Dockerfile
(1 hunks)build.sbt
(0 hunks)quickstart/cloud_gcp/scripts/load_data.sh
(1 hunks)
💤 Files with no reviewable changes (1)
- build.sbt
🔇 Additional comments (1)
Dockerfile (1)
60-65
: Verify if these specific versions resolve the opencensus conflictLet's verify if these versions resolve the NoSuchMethodError.
# Add some additional custom jars for other connectors like BigTable etc | ||
RUN mkdir -p /opt/custom-jars && \ | ||
curl -L "https://repo1.maven.org/maven2/com/google/cloud/spark/bigtable/spark-bigtable_2.12/0.2.1/spark-bigtable_2.12-0.2.1.jar" \ | ||
-o /opt/custom-jars/spark-bigtable_2.12-0.2.1.jar && \ | ||
curl -L "https://repo1.maven.org/maven2/org/apache/logging/log4j/log4j-slf4j-impl/2.20.0/log4j-slf4j-impl-2.20.0.jar" \ | ||
-o /opt/custom-jars/log4j-slf4j-impl-2.20.0.jar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add checksum verification for downloaded JARs
Verify JAR integrity to prevent supply chain attacks.
RUN mkdir -p /opt/custom-jars && \
+ # spark-bigtable
curl -L "https://repo1.maven.org/maven2/com/google/cloud/spark/bigtable/spark-bigtable_2.12/0.2.1/spark-bigtable_2.12-0.2.1.jar" \
- -o /opt/custom-jars/spark-bigtable_2.12-0.2.1.jar && \
+ -o /opt/custom-jars/spark-bigtable_2.12-0.2.1.jar && \
+ echo "expected-sha256-here /opt/custom-jars/spark-bigtable_2.12-0.2.1.jar" | sha256sum -c && \
+ # log4j-slf4j-impl
curl -L "https://repo1.maven.org/maven2/org/apache/logging/log4j/log4j-slf4j-impl/2.20.0/log4j-slf4j-impl-2.20.0.jar" \
- -o /opt/custom-jars/log4j-slf4j-impl-2.20.0.jar
+ -o /opt/custom-jars/log4j-slf4j-impl-2.20.0.jar && \
+ echo "expected-sha256-here /opt/custom-jars/log4j-slf4j-impl-2.20.0.jar" | sha256sum -c
Committable suggestion skipped: line range outside the PR's diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you sir!!
## Summary While testing the Flink side of things, I noticed that our DataProc submission was broken due to some opencensus version mismatches: ``` [info] com.google.common.util.concurrent.ExecutionError: java.lang.NoSuchMethodError: 'io.opencensus.tags.TagContext io.opencensus.tags.unsafe.ContextUtils.getValue(io.grpc.Context)' [info] at com.google.common.util.concurrent.Futures.wrapAndThrowUnchecked(Futures.java:1387) [info] at com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1380) [info] at com.google.api.gax.rpc.ApiExceptions.callAndTranslateApiException(ApiExceptions.java:53) [info] at com.google.api.gax.rpc.UnaryCallable.call(UnaryCallable.java:112) [info] at com.google.cloud.dataproc.v1.JobControllerClient.submitJob(JobControllerClient.java:435) [info] at com.google.cloud.dataproc.v1.JobControllerClient.submitJob(JobControllerClient.java:404) [info] at ai.chronon.integrations.cloud_gcp.DataprocSubmitter.submit(DataprocSubmitter.scala:70) [info] at ai.chronon.integrations.cloud_gcp.test.DataprocSubmitterTest.$anonfun$new$4(DataprocSubmitterTest.scala:77) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] ... [info] Cause: java.lang.NoSuchMethodError: 'io.opencensus.tags.TagContext io.opencensus.tags.unsafe.ContextUtils.getValue(io.grpc.Context)' [info] at io.opencensus.implcore.tags.CurrentTagMapUtils.getCurrentTagMap(CurrentTagMapUtils.java:37) [info] at io.opencensus.implcore.tags.TaggerImpl.getCurrentTagContext(TaggerImpl.java:51) [info] at io.opencensus.implcore.tags.TaggerImpl.getCurrentTagContext(TaggerImpl.java:31) [info] at io.grpc.census.CensusStatsModule$StatsClientInterceptor.interceptCall(CensusStatsModule.java:801) [info] at io.grpc.ClientInterceptors$InterceptorChannel.newCall(ClientInterceptors.java:156) [info] at com.google.api.gax.grpc.GrpcChannelUUIDInterceptor.interceptCall(GrpcChannelUUIDInterceptor.java:52) [info] at io.grpc.ClientInterceptors$InterceptorChannel.newCall(ClientInterceptors.java:156) [info] at com.google.api.gax.grpc.GrpcHeaderInterceptor.interceptCall(GrpcHeaderInterceptor.java:80) [info] at io.grpc.ClientInterceptors$InterceptorChannel.newCall(ClientInterceptors.java:156) [info] at com.google.api.gax.grpc.GrpcMetadataHandlerInterceptor.interceptCall(GrpcMetadataHandlerInterceptor.java:54) [info] ... ``` Tried playing with different opencensus versions + lib management etc and @tchow-zlai pointed out that this was due to the spark-bigtable dep. We don't need this dep for the mainstream as we're using the BQ export data query to load data into BT. The Spark-Bt connector is required primarily for docker quickstart testing. So in this PR, I've yanked the dep to re-enable the dataproc submission and reworked the load-data script and Dockerfile to download the spark-bigtable + slf4j jars to pass them in only for the Spark2bigTableLoader app in the docker setup. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Updated Docker image to include custom JAR files for BigTable and logging - Modified Spark configuration to use updated Bigtable dependencies - Enhanced data loading script with improved JAR file handling - **Dependency Updates** - Replaced Spark-specific BigTable connector with general Bigtable HBase library - Updated Google Cloud library dependencies to latest versions <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary While testing the Flink side of things, I noticed that our DataProc submission was broken due to some opencensus version mismatches: ``` [info] com.google.common.util.concurrent.ExecutionError: java.lang.NoSuchMethodError: 'io.opencensus.tags.TagContext io.opencensus.tags.unsafe.ContextUtils.getValue(io.grpc.Context)' [info] at com.google.common.util.concurrent.Futures.wrapAndThrowUnchecked(Futures.java:1387) [info] at com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1380) [info] at com.google.api.gax.rpc.ApiExceptions.callAndTranslateApiException(ApiExceptions.java:53) [info] at com.google.api.gax.rpc.UnaryCallable.call(UnaryCallable.java:112) [info] at com.google.cloud.dataproc.v1.JobControllerClient.submitJob(JobControllerClient.java:435) [info] at com.google.cloud.dataproc.v1.JobControllerClient.submitJob(JobControllerClient.java:404) [info] at ai.chronon.integrations.cloud_gcp.DataprocSubmitter.submit(DataprocSubmitter.scala:70) [info] at ai.chronon.integrations.cloud_gcp.test.DataprocSubmitterTest.$anonfun$new$4(DataprocSubmitterTest.scala:77) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] ... [info] Cause: java.lang.NoSuchMethodError: 'io.opencensus.tags.TagContext io.opencensus.tags.unsafe.ContextUtils.getValue(io.grpc.Context)' [info] at io.opencensus.implcore.tags.CurrentTagMapUtils.getCurrentTagMap(CurrentTagMapUtils.java:37) [info] at io.opencensus.implcore.tags.TaggerImpl.getCurrentTagContext(TaggerImpl.java:51) [info] at io.opencensus.implcore.tags.TaggerImpl.getCurrentTagContext(TaggerImpl.java:31) [info] at io.grpc.census.CensusStatsModule$StatsClientInterceptor.interceptCall(CensusStatsModule.java:801) [info] at io.grpc.ClientInterceptors$InterceptorChannel.newCall(ClientInterceptors.java:156) [info] at com.google.api.gax.grpc.GrpcChannelUUIDInterceptor.interceptCall(GrpcChannelUUIDInterceptor.java:52) [info] at io.grpc.ClientInterceptors$InterceptorChannel.newCall(ClientInterceptors.java:156) [info] at com.google.api.gax.grpc.GrpcHeaderInterceptor.interceptCall(GrpcHeaderInterceptor.java:80) [info] at io.grpc.ClientInterceptors$InterceptorChannel.newCall(ClientInterceptors.java:156) [info] at com.google.api.gax.grpc.GrpcMetadataHandlerInterceptor.interceptCall(GrpcMetadataHandlerInterceptor.java:54) [info] ... ``` Tried playing with different opencensus versions + lib management etc and @tchow-zlai pointed out that this was due to the spark-bigtable dep. We don't need this dep for the mainstream as we're using the BQ export data query to load data into BT. The Spark-Bt connector is required primarily for docker quickstart testing. So in this PR, I've yanked the dep to re-enable the dataproc submission and reworked the load-data script and Dockerfile to download the spark-bigtable + slf4j jars to pass them in only for the Spark2bigTableLoader app in the docker setup. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Updated Docker image to include custom JAR files for BigTable and logging - Modified Spark configuration to use updated Bigtable dependencies - Enhanced data loading script with improved JAR file handling - **Dependency Updates** - Replaced Spark-specific BigTable connector with general Bigtable HBase library - Updated Google Cloud library dependencies to latest versions <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary While testing the Flink side of things, I noticed that our DataProc submission was broken due to some opencensus version mismatches: ``` [info] com.google.common.util.concurrent.ExecutionError: java.lang.NoSuchMethodError: 'io.opencensus.tags.TagContext io.opencensus.tags.unsafe.ContextUtils.getValue(io.grpc.Context)' [info] at com.google.common.util.concurrent.Futures.wrapAndThrowUnchecked(Futures.java:1387) [info] at com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1380) [info] at com.google.api.gax.rpc.ApiExceptions.callAndTranslateApiException(ApiExceptions.java:53) [info] at com.google.api.gax.rpc.UnaryCallable.call(UnaryCallable.java:112) [info] at com.google.cloud.dataproc.v1.JobControllerClient.submitJob(JobControllerClient.java:435) [info] at com.google.cloud.dataproc.v1.JobControllerClient.submitJob(JobControllerClient.java:404) [info] at ai.chronon.integrations.cloud_gcp.DataprocSubmitter.submit(DataprocSubmitter.scala:70) [info] at ai.chronon.integrations.cloud_gcp.test.DataprocSubmitterTest.$anonfun$new$4(DataprocSubmitterTest.scala:77) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] ... [info] Cause: java.lang.NoSuchMethodError: 'io.opencensus.tags.TagContext io.opencensus.tags.unsafe.ContextUtils.getValue(io.grpc.Context)' [info] at io.opencensus.implcore.tags.CurrentTagMapUtils.getCurrentTagMap(CurrentTagMapUtils.java:37) [info] at io.opencensus.implcore.tags.TaggerImpl.getCurrentTagContext(TaggerImpl.java:51) [info] at io.opencensus.implcore.tags.TaggerImpl.getCurrentTagContext(TaggerImpl.java:31) [info] at io.grpc.census.CensusStatsModule$StatsClientInterceptor.interceptCall(CensusStatsModule.java:801) [info] at io.grpc.ClientInterceptors$InterceptorChannel.newCall(ClientInterceptors.java:156) [info] at com.google.api.gax.grpc.GrpcChannelUUIDInterceptor.interceptCall(GrpcChannelUUIDInterceptor.java:52) [info] at io.grpc.ClientInterceptors$InterceptorChannel.newCall(ClientInterceptors.java:156) [info] at com.google.api.gax.grpc.GrpcHeaderInterceptor.interceptCall(GrpcHeaderInterceptor.java:80) [info] at io.grpc.ClientInterceptors$InterceptorChannel.newCall(ClientInterceptors.java:156) [info] at com.google.api.gax.grpc.GrpcMetadataHandlerInterceptor.interceptCall(GrpcMetadataHandlerInterceptor.java:54) [info] ... ``` Tried playing with different opencensus versions + lib management etc and @tchow-zlai pointed out that this was due to the spark-bigtable dep. We don't need this dep for the mainstream as we're using the BQ export data query to load data into BT. The Spark-Bt connector is required primarily for docker quickstart testing. So in this PR, I've yanked the dep to re-enable the dataproc submission and reworked the load-data script and Dockerfile to download the spark-bigtable + slf4j jars to pass them in only for the Spark2bigTableLoader app in the docker setup. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Updated Docker image to include custom JAR files for BigTable and logging - Modified Spark configuration to use updated Bigtable dependencies - Enhanced data loading script with improved JAR file handling - **Dependency Updates** - Replaced Spark-specific BigTable connector with general Bigtable HBase library - Updated Google Cloud library dependencies to latest versions <!-- end of auto-generated comment: release notes by coderabbit.ai -->
) ## Summary While testing the Flink side of things, I noticed that our DataProc submission was broken due to some opencensus version mismatches: ``` [info] com.google.common.util.concurrent.ExecutionError: java.lang.NoSuchMethodError: 'io.opencensus.tags.TagContext io.opencensus.tags.unsafe.ContextUtils.getValue(io.grpc.Context)' [info] at com.google.common.util.concurrent.Futures.wrapAndThrowUncheour clientsed(Futures.java:1387) [info] at com.google.common.util.concurrent.Futures.getUncheour clientsed(Futures.java:1380) [info] at com.google.api.gax.rpc.ApiExceptions.callAndTranslateApiException(ApiExceptions.java:53) [info] at com.google.api.gax.rpc.UnaryCallable.call(UnaryCallable.java:112) [info] at com.google.cloud.dataproc.v1.JobControllerClient.submitJob(JobControllerClient.java:435) [info] at com.google.cloud.dataproc.v1.JobControllerClient.submitJob(JobControllerClient.java:404) [info] at ai.chronon.integrations.cloud_gcp.DataprocSubmitter.submit(DataprocSubmitter.scala:70) [info] at ai.chronon.integrations.cloud_gcp.test.DataprocSubmitterTest.$anonfun$new$4(DataprocSubmitterTest.scala:77) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] ... [info] Cause: java.lang.NoSuchMethodError: 'io.opencensus.tags.TagContext io.opencensus.tags.unsafe.ContextUtils.getValue(io.grpc.Context)' [info] at io.opencensus.implcore.tags.CurrentTagMapUtils.getCurrentTagMap(CurrentTagMapUtils.java:37) [info] at io.opencensus.implcore.tags.TaggerImpl.getCurrentTagContext(TaggerImpl.java:51) [info] at io.opencensus.implcore.tags.TaggerImpl.getCurrentTagContext(TaggerImpl.java:31) [info] at io.grpc.census.CensusStatsModule$StatsClientInterceptor.interceptCall(CensusStatsModule.java:801) [info] at io.grpc.ClientInterceptors$InterceptorChannel.newCall(ClientInterceptors.java:156) [info] at com.google.api.gax.grpc.GrpcChannelUUIDInterceptor.interceptCall(GrpcChannelUUIDInterceptor.java:52) [info] at io.grpc.ClientInterceptors$InterceptorChannel.newCall(ClientInterceptors.java:156) [info] at com.google.api.gax.grpc.GrpcHeaderInterceptor.interceptCall(GrpcHeaderInterceptor.java:80) [info] at io.grpc.ClientInterceptors$InterceptorChannel.newCall(ClientInterceptors.java:156) [info] at com.google.api.gax.grpc.GrpcMetadataHandlerInterceptor.interceptCall(GrpcMetadataHandlerInterceptor.java:54) [info] ... ``` Tried playing with different opencensus versions + lib management etc and @tchow-zlai pointed out that this was due to the spark-bigtable dep. We don't need this dep for the mainstream as we're using the BQ export data query to load data into BT. The Spark-Bt connector is required primarily for doour clientser quiour clientsstart testing. So in this PR, I've yanked the dep to re-enable the dataproc submission and reworked the load-data script and Doour clientserfile to download the spark-bigtable + slf4j jars to pass them in only for the Spark2bigTableLoader app in the doour clientser setup. ## Cheour clientslist - [ ] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Updated Doour clientser image to include custom JAR files for BigTable and logging - Modified Spark configuration to use updated Bigtable dependencies - Enhanced data loading script with improved JAR file handling - **Dependency Updates** - Replaced Spark-specific BigTable connector with general Bigtable HBase library - Updated Google Cloud library dependencies to latest versions <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
While testing the Flink side of things, I noticed that our DataProc submission was broken due to some opencensus version mismatches:
Tried playing with different opencensus versions + lib management etc and @tchow-zlai pointed out that this was due to the spark-bigtable dep. We don't need this dep for the mainstream as we're using the BQ export data query to load data into BT. The Spark-Bt connector is required primarily for docker quickstart testing. So in this PR, I've yanked the dep to re-enable the dataproc submission and reworked the load-data script and Dockerfile to download the spark-bigtable + slf4j jars to pass them in only for the Spark2bigTableLoader app in the docker setup.
Checklist
Summary by CodeRabbit
New Features
Dependency Updates