Drop Spark BigTable version to unlock DataProc submission (#167)

piyush-zlai · web-flow · commit 634ccef48579 · 2025-01-02T22:25:21.000-05:00
## Summary While testing the Flink side of things, I noticed that our DataProc submission was broken due to some opencensus version mismatches: ``` [info] com.google.common.util.concurrent.ExecutionError: java.lang.NoSuchMethodError: 'io.opencensus.tags.TagContext io.opencensus.tags.unsafe.ContextUtils.getValue(io.grpc.Context)' [info] at com.google.common.util.concurrent.Futures.wrapAndThrowUnchecked(Futures.java:1387) [info] at com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1380) [info] at com.google.api.gax.rpc.ApiExceptions.callAndTranslateApiException(ApiExceptions.java:53) [info] at com.google.api.gax.rpc.UnaryCallable.call(UnaryCallable.java:112) [info] at com.google.cloud.dataproc.v1.JobControllerClient.submitJob(JobControllerClient.java:435) [info] at com.google.cloud.dataproc.v1.JobControllerClient.submitJob(JobControllerClient.java:404) [info] at ai.chronon.integrations.cloud_gcp.DataprocSubmitter.submit(DataprocSubmitter.scala:70) [info] at ai.chronon.integrations.cloud_gcp.test.DataprocSubmitterTest.$anonfun$new$4(DataprocSubmitterTest.scala:77) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] ... [info] Cause: java.lang.NoSuchMethodError: 'io.opencensus.tags.TagContext io.opencensus.tags.unsafe.ContextUtils.getValue(io.grpc.Context)' [info] at io.opencensus.implcore.tags.CurrentTagMapUtils.getCurrentTagMap(CurrentTagMapUtils.java:37) [info] at io.opencensus.implcore.tags.TaggerImpl.getCurrentTagContext(TaggerImpl.java:51) [info] at io.opencensus.implcore.tags.TaggerImpl.getCurrentTagContext(TaggerImpl.java:31) [info] at io.grpc.census.CensusStatsModule$StatsClientInterceptor.interceptCall(CensusStatsModule.java:801) [info] at io.grpc.ClientInterceptors$InterceptorChannel.newCall(ClientInterceptors.java:156) [info] at com.google.api.gax.grpc.GrpcChannelUUIDInterceptor.interceptCall(GrpcChannelUUIDInterceptor.java:52) [info] at io.grpc.ClientInterceptors$InterceptorChannel.newCall(ClientInterceptors.java:156) [info] at com.google.api.gax.grpc.GrpcHeaderInterceptor.interceptCall(GrpcHeaderInterceptor.java:80) [info] at io.grpc.ClientInterceptors$InterceptorChannel.newCall(ClientInterceptors.java:156) [info] at com.google.api.gax.grpc.GrpcMetadataHandlerInterceptor.interceptCall(GrpcMetadataHandlerInterceptor.java:54) [info] ... ``` Tried playing with different opencensus versions + lib management etc and @tchow-zlai pointed out that this was due to the spark-bigtable dep. We don't need this dep for the mainstream as we're using the BQ export data query to load data into BT. The Spark-Bt connector is required primarily for docker quickstart testing. So in this PR, I've yanked the dep to re-enable the dataproc submission and reworked the load-data script and Dockerfile to download the spark-bigtable + slf4j jars to pass them in only for the Spark2bigTableLoader app in the docker setup. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Updated Docker image to include custom JAR files for BigTable and logging - Modified Spark configuration to use updated Bigtable dependencies - Enhanced data loading script with improved JAR file handling - **Dependency Updates** - Replaced Spark-specific BigTable connector with general Bigtable HBase library - Updated Google Cloud library dependencies to latest versions
diff --git a/Dockerfile b/Dockerfile
@@ -57,6 +57,13 @@ RUN curl https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SP
  && tar xvzf spark.tgz --directory /opt/spark --strip-components 1 \
  && rm -rf spark.tgz
 
+# Add some additional custom jars for other connectors like BigTable etc
+RUN mkdir -p /opt/custom-jars && \
+    curl -L "https://repo1.maven.org/maven2/com/google/cloud/spark/bigtable/spark-bigtable_2.12/0.2.1/spark-bigtable_2.12-0.2.1.jar" \
+    -o /opt/custom-jars/spark-bigtable_2.12-0.2.1.jar && \
+    curl -L "https://repo1.maven.org/maven2/org/apache/logging/log4j/log4j-slf4j-impl/2.20.0/log4j-slf4j-impl-2.20.0.jar" \
+    -o /opt/custom-jars/log4j-slf4j-impl-2.20.0.jar
+
 # Install python deps
 COPY quickstart/requirements.txt .
 RUN pip3 install -r requirements.txt
diff --git a/build.sbt b/build.sbt
@@ -217,7 +217,6 @@ lazy val cloud_gcp = project
     libraryDependencies += "com.google.cloud.bigdataoss" % "gcsio" % "3.0.3", // need it for https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcsio/src/main/java/com/google/cloud/hadoop/gcsio/GoogleCloudStorageFileSystem.java
     libraryDependencies += "io.circe" %% "circe-yaml" % "1.15.0",
     libraryDependencies += "com.google.cloud.spark" %% s"spark-bigquery-with-dependencies" % "0.41.0",
-    libraryDependencies += "com.google.cloud.spark.bigtable" %% "spark-bigtable" % "0.2.1",
     libraryDependencies += "com.google.cloud.bigtable" % "bigtable-hbase-2.x" % "2.14.2",
     libraryDependencies ++= circe,
     libraryDependencies ++= avro,
diff --git a/quickstart/cloud_gcp/scripts/load_data.sh b/quickstart/cloud_gcp/scripts/load_data.sh
@@ -33,7 +33,9 @@ echo "GroupBy upload batch jobs completed successfully!"
 
 echo "Uploading tables to KV Store"
 for dataset in purchases returns; do
-  if ! spark-submit --driver-class-path "$CLASSPATH" --class ai.chronon.integrations.cloud_gcp.Spark2BigTableLoader \
+  if ! spark-submit --driver-class-path "$CLASSPATH:/opt/custom-jars/*" \
+    --jars "/opt/custom-jars/spark-bigtable_2.12-0.2.1.jar,/opt/custom-jars/log4j-slf4j-impl-2.20.0.jar" \
+    --class ai.chronon.integrations.cloud_gcp.Spark2BigTableLoader \
     --master local[*] $CLOUD_GCP_JAR --table-name default.quickstart_${dataset}_v1_upload --dataset quickstart.${dataset}.v1 \
     --end-ds 2023-11-30 --project-id $GCP_PROJECT_ID --instance-id $GCP_INSTANCE_ID; then
     echo "Error: Failed to upload table to KV Store" >&2