Skip to content

Commit eab58bd

Browse files
authored
Temporarily set materializationProject and materializationDataset to get bq connector to create temp tables. (#222)
## Summary https://app.asana.com/0/1208949807589885/1209143482009694 Debugging this during the early etsy integration ``` Caused by: java.lang.IllegalArgumentException: Provided dataset is null or empty at com.google.cloud.spark.bigquery.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:143) at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.TableId.<init>(TableId.java:73) at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.TableId.of(TableId.java:82) at com.google.cloud.bigquery.connector.common.BigQueryClient.createTempTableId(BigQueryClient.java:263) at com.google.cloud.bigquery.connector.common.BigQueryClient.createTempTable(BigQueryClient.java:229) at com.google.cloud.bigquery.connector.common.BigQueryClient.createTempTableAfterCheckingSchema(BigQueryClient.java:253) at com.google.cloud.spark.bigquery.write.BigQueryWriteHelper.writeDataFrameToBigQuery(BigQueryWriteHelper.java:142) ... 66 more ``` I see that this error occurs at these [lines in the open source connector code](https://github.com/GoogleCloudDataproc/spark-bigquery-connector/blob/master/bigquery-connector-common/src/main/java/com/google/cloud/bigquery/connector/common/BigQueryClient.java#L258-L260): ``` public TableId createTempTableId(TableId destinationTableId) { String tempProject = materializationProject.orElseGet(destinationTableId::getProject); String tempDataset = materializationDataset.orElseGet(destinationTableId::getDataset); String tableName = destinationTableId.getTable() + System.nanoTime(); TableId tempTableId = tempProject == null ? TableId.of(tempDataset, tableName) : TableId.of(tempProject, tempDataset, tableName); return tempTableId; } ``` my hunch is that we're getting our error because destinationTableId::getDataset is returning nothing. and materializationDataset is a connector property we never set since it's really for views (I think). Just a theory. to get past this weird error, i know we can actually set the connector properties materializationProject and materializationDataset https://github.com/GoogleCloudDataproc/spark-bigquery-connector?tab=readme-ov-file#properties so I did so in additional-confs-dev.yaml and reran the groupby backfill job with it: [dataproc job](https://console.cloud.google.com/dataproc/jobs/b7ebcff7-007f-43e4-9979-211803e9c700/configuration?region=us-central1&inv=1&invt=AbmqZQ&project=canary-443022) ``` Writing to BigQuery. options: Map(project -> canary-443022, writeMethod -> indirect, spark.sql.sources.partitionOverwriteMode -> DYNAMIC, partitionField -> ds, materializationDataset -> data, dataset -> data, materializationProject -> canary-443022, temporaryGcsBucket -> zl-warehouse) ``` and now i'm able to consistently get past this error Provided dataset is null or empty and "massage" the connector into just creating the temp table ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Configuration Updates** - Added new Google Cloud Storage configuration properties for Chronon integration - Specified output dataset and project details for data materialization <!-- end of auto-generated comment: release notes by coderabbit.ai -->
1 parent 67b2748 commit eab58bd

File tree

2 files changed

+6
-2
lines changed

2 files changed

+6
-2
lines changed
Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
spark.chronon.table.format_provider.class: "ai.chronon.integrations.cloud_gcp.GcpFormatProvider"
22
spark.chronon.partition.format: "yyyy-MM-dd"
3-
spark.chronon.table.gcs.temporary_gcs_bucket: "zl-warehouse"
3+
spark.chronon.table.gcs.temporary_gcs_bucket: "zl-warehouse"
4+
spark.chronon.table.gcs.connector_output_dataset: "data"
5+
spark.chronon.table.gcs.connector_output_project: "canary-443022"

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/GcpFormatProvider.scala

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,9 @@ case class GcpFormatProvider(sparkSession: SparkSession) extends FormatProvider
4343
val sparkOptions: Map[String, String] = Map(
4444
// todo(tchow): No longer needed after https://github.com/GoogleCloudDataproc/spark-bigquery-connector/pull/1320
4545
"temporaryGcsBucket" -> sparkSession.conf.get("spark.chronon.table.gcs.temporary_gcs_bucket"),
46-
"writeMethod" -> "indirect"
46+
"writeMethod" -> "indirect",
47+
"materializationProject" -> sparkSession.conf.get("spark.chronon.table.gcs.connector_output_project"),
48+
"materializationDataset" -> sparkSession.conf.get("spark.chronon.table.gcs.connector_output_dataset")
4749
) ++ partitionColumnOption
4850

4951
BigQueryFormat(tableId.getProject, sparkOptions)

0 commit comments

Comments
 (0)