Temporarily set materializationProject and materializationDataset to get bq connector to create temp tables. (#222)

david-zlai · web-flow · commit eab58bd45ed2 · 2025-01-15T17:56:13.000-05:00
## Summary https://app.asana.com/0/1208949807589885/1209143482009694 Debugging this during the early etsy integration ``` Caused by: java.lang.IllegalArgumentException: Provided dataset is null or empty at com.google.cloud.spark.bigquery.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:143) at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.TableId.<init>(TableId.java:73) at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.TableId.of(TableId.java:82) at com.google.cloud.bigquery.connector.common.BigQueryClient.createTempTableId(BigQueryClient.java:263) at com.google.cloud.bigquery.connector.common.BigQueryClient.createTempTable(BigQueryClient.java:229) at com.google.cloud.bigquery.connector.common.BigQueryClient.createTempTableAfterCheckingSchema(BigQueryClient.java:253) at com.google.cloud.spark.bigquery.write.BigQueryWriteHelper.writeDataFrameToBigQuery(BigQueryWriteHelper.java:142) ... 66 more ``` I see that this error occurs at these [lines in the open source connector code](https://github.com/GoogleCloudDataproc/spark-bigquery-connector/blob/master/bigquery-connector-common/src/main/java/com/google/cloud/bigquery/connector/common/BigQueryClient.java#L258-L260): ``` public TableId createTempTableId(TableId destinationTableId) { String tempProject = materializationProject.orElseGet(destinationTableId::getProject); String tempDataset = materializationDataset.orElseGet(destinationTableId::getDataset); String tableName = destinationTableId.getTable() + System.nanoTime(); TableId tempTableId = tempProject == null ? TableId.of(tempDataset, tableName) : TableId.of(tempProject, tempDataset, tableName); return tempTableId; } ``` my hunch is that we're getting our error because destinationTableId::getDataset is returning nothing. and materializationDataset is a connector property we never set since it's really for views (I think). Just a theory. to get past this weird error, i know we can actually set the connector properties materializationProject and materializationDataset https://github.com/GoogleCloudDataproc/spark-bigquery-connector?tab=readme-ov-file#properties so I did so in additional-confs-dev.yaml and reran the groupby backfill job with it: [dataproc job](https://console.cloud.google.com/dataproc/jobs/b7ebcff7-007f-43e4-9979-211803e9c700/configuration?region=us-central1&inv=1&invt=AbmqZQ&project=canary-443022) ``` Writing to BigQuery. options: Map(project -> canary-443022, writeMethod -> indirect, spark.sql.sources.partitionOverwriteMode -> DYNAMIC, partitionField -> ds, materializationDataset -> data, dataset -> data, materializationProject -> canary-443022, temporaryGcsBucket -> zl-warehouse) ``` and now i'm able to consistently get past this error Provided dataset is null or empty and "massage" the connector into just creating the temp table ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **Configuration Updates** - Added new Google Cloud Storage configuration properties for Chronon integration - Specified output dataset and project details for data materialization
diff --git a/cloud_gcp/src/main/resources/additional-confs.yaml b/cloud_gcp/src/main/resources/additional-confs.yaml
@@ -1,3 +1,5 @@
 spark.chronon.table.format_provider.class: "ai.chronon.integrations.cloud_gcp.GcpFormatProvider"
 spark.chronon.partition.format: "yyyy-MM-dd"
-spark.chronon.table.gcs.temporary_gcs_bucket: "zl-warehouse"
+spark.chronon.table.gcs.temporary_gcs_bucket: "zl-warehouse"
+spark.chronon.table.gcs.connector_output_dataset: "data"
+spark.chronon.table.gcs.connector_output_project: "canary-443022"
diff --git a/cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/GcpFormatProvider.scala b/cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/GcpFormatProvider.scala
@@ -43,7 +43,9 @@ case class GcpFormatProvider(sparkSession: SparkSession) extends FormatProvider
     val sparkOptions: Map[String, String] = Map(
       // todo(tchow): No longer needed after https://github.com/GoogleCloudDataproc/spark-bigquery-connector/pull/1320
       "temporaryGcsBucket" -> sparkSession.conf.get("spark.chronon.table.gcs.temporary_gcs_bucket"),
-      "writeMethod" -> "indirect"
+      "writeMethod" -> "indirect",
+      "materializationProject" -> sparkSession.conf.get("spark.chronon.table.gcs.connector_output_project"),
+      "materializationDataset" -> sparkSession.conf.get("spark.chronon.table.gcs.connector_output_dataset")
     ) ++ partitionColumnOption
 
     BigQueryFormat(tableId.getProject, sparkOptions)