Skip to content

Commit 0aa2ec4

Browse files
authored
Drop Spark BigTable version to unlock DataProc submission (#167)
## Summary While testing the Flink side of things, I noticed that our DataProc submission was broken due to some opencensus version mismatches: ``` [info] com.google.common.util.concurrent.ExecutionError: java.lang.NoSuchMethodError: 'io.opencensus.tags.TagContext io.opencensus.tags.unsafe.ContextUtils.getValue(io.grpc.Context)' [info] at com.google.common.util.concurrent.Futures.wrapAndThrowUnchecked(Futures.java:1387) [info] at com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1380) [info] at com.google.api.gax.rpc.ApiExceptions.callAndTranslateApiException(ApiExceptions.java:53) [info] at com.google.api.gax.rpc.UnaryCallable.call(UnaryCallable.java:112) [info] at com.google.cloud.dataproc.v1.JobControllerClient.submitJob(JobControllerClient.java:435) [info] at com.google.cloud.dataproc.v1.JobControllerClient.submitJob(JobControllerClient.java:404) [info] at ai.chronon.integrations.cloud_gcp.DataprocSubmitter.submit(DataprocSubmitter.scala:70) [info] at ai.chronon.integrations.cloud_gcp.test.DataprocSubmitterTest.$anonfun$new$4(DataprocSubmitterTest.scala:77) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] ... [info] Cause: java.lang.NoSuchMethodError: 'io.opencensus.tags.TagContext io.opencensus.tags.unsafe.ContextUtils.getValue(io.grpc.Context)' [info] at io.opencensus.implcore.tags.CurrentTagMapUtils.getCurrentTagMap(CurrentTagMapUtils.java:37) [info] at io.opencensus.implcore.tags.TaggerImpl.getCurrentTagContext(TaggerImpl.java:51) [info] at io.opencensus.implcore.tags.TaggerImpl.getCurrentTagContext(TaggerImpl.java:31) [info] at io.grpc.census.CensusStatsModule$StatsClientInterceptor.interceptCall(CensusStatsModule.java:801) [info] at io.grpc.ClientInterceptors$InterceptorChannel.newCall(ClientInterceptors.java:156) [info] at com.google.api.gax.grpc.GrpcChannelUUIDInterceptor.interceptCall(GrpcChannelUUIDInterceptor.java:52) [info] at io.grpc.ClientInterceptors$InterceptorChannel.newCall(ClientInterceptors.java:156) [info] at com.google.api.gax.grpc.GrpcHeaderInterceptor.interceptCall(GrpcHeaderInterceptor.java:80) [info] at io.grpc.ClientInterceptors$InterceptorChannel.newCall(ClientInterceptors.java:156) [info] at com.google.api.gax.grpc.GrpcMetadataHandlerInterceptor.interceptCall(GrpcMetadataHandlerInterceptor.java:54) [info] ... ``` Tried playing with different opencensus versions + lib management etc and @tchow-zlai pointed out that this was due to the spark-bigtable dep. We don't need this dep for the mainstream as we're using the BQ export data query to load data into BT. The Spark-Bt connector is required primarily for docker quickstart testing. So in this PR, I've yanked the dep to re-enable the dataproc submission and reworked the load-data script and Dockerfile to download the spark-bigtable + slf4j jars to pass them in only for the Spark2bigTableLoader app in the docker setup. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Updated Docker image to include custom JAR files for BigTable and logging - Modified Spark configuration to use updated Bigtable dependencies - Enhanced data loading script with improved JAR file handling - **Dependency Updates** - Replaced Spark-specific BigTable connector with general Bigtable HBase library - Updated Google Cloud library dependencies to latest versions <!-- end of auto-generated comment: release notes by coderabbit.ai -->
1 parent 9faefb1 commit 0aa2ec4

File tree

2 files changed

+3
-2
lines changed

2 files changed

+3
-2
lines changed

build.sbt

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -217,7 +217,6 @@ lazy val cloud_gcp = project
217217
libraryDependencies += "com.google.cloud.bigdataoss" % "gcsio" % "3.0.3", // need it for https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcsio/src/main/java/com/google/cloud/hadoop/gcsio/GoogleCloudStorageFileSystem.java
218218
libraryDependencies += "io.circe" %% "circe-yaml" % "1.15.0",
219219
libraryDependencies += "com.google.cloud.spark" %% s"spark-bigquery-with-dependencies" % "0.41.0",
220-
libraryDependencies += "com.google.cloud.spark.bigtable" %% "spark-bigtable" % "0.2.1",
221220
libraryDependencies += "com.google.cloud.bigtable" % "bigtable-hbase-2.x" % "2.14.2",
222221
libraryDependencies ++= circe,
223222
libraryDependencies ++= avro,

quickstart/cloud_gcp/scripts/load_data.sh

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,9 @@ echo "GroupBy upload batch jobs completed successfully!"
3333

3434
echo "Uploading tables to KV Store"
3535
for dataset in purchases returns; do
36-
if ! spark-submit --driver-class-path "$CLASSPATH" --class ai.chronon.integrations.cloud_gcp.Spark2BigTableLoader \
36+
if ! spark-submit --driver-class-path "$CLASSPATH:/opt/custom-jars/*" \
37+
--jars "/opt/custom-jars/spark-bigtable_2.12-0.2.1.jar,/opt/custom-jars/log4j-slf4j-impl-2.20.0.jar" \
38+
--class ai.chronon.integrations.cloud_gcp.Spark2BigTableLoader \
3739
--master local[*] $CLOUD_GCP_JAR --table-name default.quickstart_${dataset}_v1_upload --dataset quickstart.${dataset}.v1 \
3840
--end-ds 2023-11-30 --project-id $GCP_PROJECT_ID --instance-id $GCP_INSTANCE_ID; then
3941
echo "Error: Failed to upload table to KV Store" >&2

0 commit comments

Comments
 (0)