Skip to content

Commit 634ccef

Browse files
authored
Drop Spark BigTable version to unlock DataProc submission (#167)
## Summary While testing the Flink side of things, I noticed that our DataProc submission was broken due to some opencensus version mismatches: ``` [info] com.google.common.util.concurrent.ExecutionError: java.lang.NoSuchMethodError: 'io.opencensus.tags.TagContext io.opencensus.tags.unsafe.ContextUtils.getValue(io.grpc.Context)' [info] at com.google.common.util.concurrent.Futures.wrapAndThrowUnchecked(Futures.java:1387) [info] at com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1380) [info] at com.google.api.gax.rpc.ApiExceptions.callAndTranslateApiException(ApiExceptions.java:53) [info] at com.google.api.gax.rpc.UnaryCallable.call(UnaryCallable.java:112) [info] at com.google.cloud.dataproc.v1.JobControllerClient.submitJob(JobControllerClient.java:435) [info] at com.google.cloud.dataproc.v1.JobControllerClient.submitJob(JobControllerClient.java:404) [info] at ai.chronon.integrations.cloud_gcp.DataprocSubmitter.submit(DataprocSubmitter.scala:70) [info] at ai.chronon.integrations.cloud_gcp.test.DataprocSubmitterTest.$anonfun$new$4(DataprocSubmitterTest.scala:77) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] ... [info] Cause: java.lang.NoSuchMethodError: 'io.opencensus.tags.TagContext io.opencensus.tags.unsafe.ContextUtils.getValue(io.grpc.Context)' [info] at io.opencensus.implcore.tags.CurrentTagMapUtils.getCurrentTagMap(CurrentTagMapUtils.java:37) [info] at io.opencensus.implcore.tags.TaggerImpl.getCurrentTagContext(TaggerImpl.java:51) [info] at io.opencensus.implcore.tags.TaggerImpl.getCurrentTagContext(TaggerImpl.java:31) [info] at io.grpc.census.CensusStatsModule$StatsClientInterceptor.interceptCall(CensusStatsModule.java:801) [info] at io.grpc.ClientInterceptors$InterceptorChannel.newCall(ClientInterceptors.java:156) [info] at com.google.api.gax.grpc.GrpcChannelUUIDInterceptor.interceptCall(GrpcChannelUUIDInterceptor.java:52) [info] at io.grpc.ClientInterceptors$InterceptorChannel.newCall(ClientInterceptors.java:156) [info] at com.google.api.gax.grpc.GrpcHeaderInterceptor.interceptCall(GrpcHeaderInterceptor.java:80) [info] at io.grpc.ClientInterceptors$InterceptorChannel.newCall(ClientInterceptors.java:156) [info] at com.google.api.gax.grpc.GrpcMetadataHandlerInterceptor.interceptCall(GrpcMetadataHandlerInterceptor.java:54) [info] ... ``` Tried playing with different opencensus versions + lib management etc and @tchow-zlai pointed out that this was due to the spark-bigtable dep. We don't need this dep for the mainstream as we're using the BQ export data query to load data into BT. The Spark-Bt connector is required primarily for docker quickstart testing. So in this PR, I've yanked the dep to re-enable the dataproc submission and reworked the load-data script and Dockerfile to download the spark-bigtable + slf4j jars to pass them in only for the Spark2bigTableLoader app in the docker setup. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Updated Docker image to include custom JAR files for BigTable and logging - Modified Spark configuration to use updated Bigtable dependencies - Enhanced data loading script with improved JAR file handling - **Dependency Updates** - Replaced Spark-specific BigTable connector with general Bigtable HBase library - Updated Google Cloud library dependencies to latest versions <!-- end of auto-generated comment: release notes by coderabbit.ai -->
1 parent a4f162c commit 634ccef

File tree

3 files changed

+10
-2
lines changed

3 files changed

+10
-2
lines changed

Dockerfile

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,13 @@ RUN curl https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SP
5757
&& tar xvzf spark.tgz --directory /opt/spark --strip-components 1 \
5858
&& rm -rf spark.tgz
5959

60+
# Add some additional custom jars for other connectors like BigTable etc
61+
RUN mkdir -p /opt/custom-jars && \
62+
curl -L "https://repo1.maven.org/maven2/com/google/cloud/spark/bigtable/spark-bigtable_2.12/0.2.1/spark-bigtable_2.12-0.2.1.jar" \
63+
-o /opt/custom-jars/spark-bigtable_2.12-0.2.1.jar && \
64+
curl -L "https://repo1.maven.org/maven2/org/apache/logging/log4j/log4j-slf4j-impl/2.20.0/log4j-slf4j-impl-2.20.0.jar" \
65+
-o /opt/custom-jars/log4j-slf4j-impl-2.20.0.jar
66+
6067
# Install python deps
6168
COPY quickstart/requirements.txt .
6269
RUN pip3 install -r requirements.txt

build.sbt

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -217,7 +217,6 @@ lazy val cloud_gcp = project
217217
libraryDependencies += "com.google.cloud.bigdataoss" % "gcsio" % "3.0.3", // need it for https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcsio/src/main/java/com/google/cloud/hadoop/gcsio/GoogleCloudStorageFileSystem.java
218218
libraryDependencies += "io.circe" %% "circe-yaml" % "1.15.0",
219219
libraryDependencies += "com.google.cloud.spark" %% s"spark-bigquery-with-dependencies" % "0.41.0",
220-
libraryDependencies += "com.google.cloud.spark.bigtable" %% "spark-bigtable" % "0.2.1",
221220
libraryDependencies += "com.google.cloud.bigtable" % "bigtable-hbase-2.x" % "2.14.2",
222221
libraryDependencies ++= circe,
223222
libraryDependencies ++= avro,

quickstart/cloud_gcp/scripts/load_data.sh

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,9 @@ echo "GroupBy upload batch jobs completed successfully!"
3333

3434
echo "Uploading tables to KV Store"
3535
for dataset in purchases returns; do
36-
if ! spark-submit --driver-class-path "$CLASSPATH" --class ai.chronon.integrations.cloud_gcp.Spark2BigTableLoader \
36+
if ! spark-submit --driver-class-path "$CLASSPATH:/opt/custom-jars/*" \
37+
--jars "/opt/custom-jars/spark-bigtable_2.12-0.2.1.jar,/opt/custom-jars/log4j-slf4j-impl-2.20.0.jar" \
38+
--class ai.chronon.integrations.cloud_gcp.Spark2BigTableLoader \
3739
--master local[*] $CLOUD_GCP_JAR --table-name default.quickstart_${dataset}_v1_upload --dataset quickstart.${dataset}.v1 \
3840
--end-ds 2023-11-30 --project-id $GCP_PROJECT_ID --instance-id $GCP_INSTANCE_ID; then
3941
echo "Error: Failed to upload table to KV Store" >&2

0 commit comments

Comments
 (0)