-
Notifications
You must be signed in to change notification settings - Fork 0
Temporarily set materializationProject and materializationDataset to get bq connector to create temp tables. #222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…get bq connector to create temp tables.
WalkthroughThe pull request introduces configuration enhancements for Google Cloud Storage (GCS) integration in the Chronon project. Two new Spark configuration properties have been added to specify the output dataset and project for GCS connector operations. The changes are minimal but provide more granular control over GCS-related data materialization settings. Changes
Possibly related PRs
Suggested Reviewers
Poem
Warning Review ran into problems🔥 ProblemsGitHub Actions: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository. Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
@@ -43,7 +43,9 @@ case class GcpFormatProvider(sparkSession: SparkSession) extends FormatProvider | |||
val sparkOptions: Map[String, String] = Map( | |||
// todo(tchow): No longer needed after https://github.com/GoogleCloudDataproc/spark-bigquery-connector/pull/1320 | |||
"temporaryGcsBucket" -> sparkSession.conf.get("spark.chronon.table.gcs.temporary_gcs_bucket"), | |||
"writeMethod" -> "indirect" | |||
"writeMethod" -> "indirect", | |||
"materializationProject" -> sparkSession.conf.get("spark.chronon.table.gcs.connector_output_project"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you sir!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/GcpFormatProvider.scala (1)
46-48
: Mark code as temporary with TODO comment.Add a TODO comment indicating this is a temporary fix pending proper connector update.
val sparkOptions: Map[String, String] = Map( "temporaryGcsBucket" -> sparkSession.conf.get("spark.chronon.table.gcs.temporary_gcs_bucket"), "writeMethod" -> "indirect", + // TODO: Temporary fix until BigQuery connector is updated to handle temp tables properly "materializationProject" -> sparkSession.conf.get("spark.chronon.table.gcs.connector_output_project"), "materializationDataset" -> sparkSession.conf.get("spark.chronon.table.gcs.connector_output_dataset")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (2)
cloud_gcp/src/main/resources/additional-confs.yaml
(1 hunks)cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/GcpFormatProvider.scala
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: scala_compile_fmt_fix
🔇 Additional comments (1)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/GcpFormatProvider.scala (1)
46-48
: Add config validation and error handling.Missing configurations could cause runtime failures. Add checks for required configs.
spark.chronon.table.gcs.connector_output_project: "canary-443022" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use environment variables for sensitive values.
Hard-coded project ID poses security risk.
-spark.chronon.table.gcs.connector_output_dataset: "data"
-spark.chronon.table.gcs.connector_output_project: "canary-443022"
+spark.chronon.table.gcs.connector_output_dataset: ${CHRONON_BQ_DATASET}
+spark.chronon.table.gcs.connector_output_project: ${CHRONON_BQ_PROJECT}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
spark.chronon.table.gcs.connector_output_dataset: "data" | |
spark.chronon.table.gcs.connector_output_project: "canary-443022" | |
spark.chronon.table.gcs.connector_output_dataset: ${CHRONON_BQ_DATASET} | |
spark.chronon.table.gcs.connector_output_project: ${CHRONON_BQ_PROJECT} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relax homie
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@david-zlai Understood! I'll back off on the security suggestion. Let me know if you need any other assistance with the PR.
✏️ Learnings added
Learnt from: david-zlai
PR: zipline-ai/chronon#222
File: cloud_gcp/src/main/resources/additional-confs.yaml:4-5
Timestamp: 2025-01-15T21:00:45.243Z
Learning: Security suggestions about using environment variables for project IDs and datasets in additional-confs.yaml were deemed non-critical by the team, as these values may be temporarily hardcoded for development purposes.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
spark.chronon.table.gcs.temporary_gcs_bucket: "zl-warehouse" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Move development-specific bucket to dev config.
Bucket zl-warehouse
appears development-specific.
-spark.chronon.table.gcs.temporary_gcs_bucket: "zl-warehouse"
Move to additional-confs-dev.yaml
Committable suggestion skipped: line range outside the PR's diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nah
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@david-zlai Understood, will keep the bucket configuration as is.
✏️ Learnings added
Learnt from: david-zlai
PR: zipline-ai/chronon#222
File: cloud_gcp/src/main/resources/additional-confs.yaml:3-3
Timestamp: 2025-01-15T21:00:35.327Z
Learning: The GCS bucket configuration `spark.chronon.table.gcs.temporary_gcs_bucket: "zl-warehouse"` should remain in the main `additional-confs.yaml` file, not in dev-specific configs.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
…get bq connector to create temp tables. (#222) ## Summary https://app.asana.com/0/1208949807589885/1209143482009694 Debugging this during the early etsy integration ``` Caused by: java.lang.IllegalArgumentException: Provided dataset is null or empty at com.google.cloud.spark.bigquery.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:143) at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.TableId.<init>(TableId.java:73) at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.TableId.of(TableId.java:82) at com.google.cloud.bigquery.connector.common.BigQueryClient.createTempTableId(BigQueryClient.java:263) at com.google.cloud.bigquery.connector.common.BigQueryClient.createTempTable(BigQueryClient.java:229) at com.google.cloud.bigquery.connector.common.BigQueryClient.createTempTableAfterCheckingSchema(BigQueryClient.java:253) at com.google.cloud.spark.bigquery.write.BigQueryWriteHelper.writeDataFrameToBigQuery(BigQueryWriteHelper.java:142) ... 66 more ``` I see that this error occurs at these [lines in the open source connector code](https://github.com/GoogleCloudDataproc/spark-bigquery-connector/blob/master/bigquery-connector-common/src/main/java/com/google/cloud/bigquery/connector/common/BigQueryClient.java#L258-L260): ``` public TableId createTempTableId(TableId destinationTableId) { String tempProject = materializationProject.orElseGet(destinationTableId::getProject); String tempDataset = materializationDataset.orElseGet(destinationTableId::getDataset); String tableName = destinationTableId.getTable() + System.nanoTime(); TableId tempTableId = tempProject == null ? TableId.of(tempDataset, tableName) : TableId.of(tempProject, tempDataset, tableName); return tempTableId; } ``` my hunch is that we're getting our error because destinationTableId::getDataset is returning nothing. and materializationDataset is a connector property we never set since it's really for views (I think). Just a theory. to get past this weird error, i know we can actually set the connector properties materializationProject and materializationDataset https://github.com/GoogleCloudDataproc/spark-bigquery-connector?tab=readme-ov-file#properties so I did so in additional-confs-dev.yaml and reran the groupby backfill job with it: [dataproc job](https://console.cloud.google.com/dataproc/jobs/b7ebcff7-007f-43e4-9979-211803e9c700/configuration?region=us-central1&inv=1&invt=AbmqZQ&project=canary-443022) ``` Writing to BigQuery. options: Map(project -> canary-443022, writeMethod -> indirect, spark.sql.sources.partitionOverwriteMode -> DYNAMIC, partitionField -> ds, materializationDataset -> data, dataset -> data, materializationProject -> canary-443022, temporaryGcsBucket -> zl-warehouse) ``` and now i'm able to consistently get past this error Provided dataset is null or empty and "massage" the connector into just creating the temp table ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Configuration Updates** - Added new Google Cloud Storage configuration properties for Chronon integration - Specified output dataset and project details for data materialization <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…get bq connector to create temp tables. (#222) ## Summary https://app.asana.com/0/1208949807589885/1209143482009694 Debugging this during the early etsy integration ``` Caused by: java.lang.IllegalArgumentException: Provided dataset is null or empty at com.google.cloud.spark.bigquery.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:143) at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.TableId.<init>(TableId.java:73) at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.TableId.of(TableId.java:82) at com.google.cloud.bigquery.connector.common.BigQueryClient.createTempTableId(BigQueryClient.java:263) at com.google.cloud.bigquery.connector.common.BigQueryClient.createTempTable(BigQueryClient.java:229) at com.google.cloud.bigquery.connector.common.BigQueryClient.createTempTableAfterCheckingSchema(BigQueryClient.java:253) at com.google.cloud.spark.bigquery.write.BigQueryWriteHelper.writeDataFrameToBigQuery(BigQueryWriteHelper.java:142) ... 66 more ``` I see that this error occurs at these [lines in the open source connector code](https://github.com/GoogleCloudDataproc/spark-bigquery-connector/blob/master/bigquery-connector-common/src/main/java/com/google/cloud/bigquery/connector/common/BigQueryClient.java#L258-L260): ``` public TableId createTempTableId(TableId destinationTableId) { String tempProject = materializationProject.orElseGet(destinationTableId::getProject); String tempDataset = materializationDataset.orElseGet(destinationTableId::getDataset); String tableName = destinationTableId.getTable() + System.nanoTime(); TableId tempTableId = tempProject == null ? TableId.of(tempDataset, tableName) : TableId.of(tempProject, tempDataset, tableName); return tempTableId; } ``` my hunch is that we're getting our error because destinationTableId::getDataset is returning nothing. and materializationDataset is a connector property we never set since it's really for views (I think). Just a theory. to get past this weird error, i know we can actually set the connector properties materializationProject and materializationDataset https://github.com/GoogleCloudDataproc/spark-bigquery-connector?tab=readme-ov-file#properties so I did so in additional-confs-dev.yaml and reran the groupby backfill job with it: [dataproc job](https://console.cloud.google.com/dataproc/jobs/b7ebcff7-007f-43e4-9979-211803e9c700/configuration?region=us-central1&inv=1&invt=AbmqZQ&project=canary-443022) ``` Writing to BigQuery. options: Map(project -> canary-443022, writeMethod -> indirect, spark.sql.sources.partitionOverwriteMode -> DYNAMIC, partitionField -> ds, materializationDataset -> data, dataset -> data, materializationProject -> canary-443022, temporaryGcsBucket -> zl-warehouse) ``` and now i'm able to consistently get past this error Provided dataset is null or empty and "massage" the connector into just creating the temp table ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Configuration Updates** - Added new Google Cloud Storage configuration properties for Chronon integration - Specified output dataset and project details for data materialization <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…get bq connector to create temp tables. (#222) ## Summary https://app.asana.com/0/1208949807589885/1209143482009694 Debugging this during the early etsy integration ``` Caused by: java.lang.IllegalArgumentException: Provided dataset is null or empty at com.google.cloud.spark.bigquery.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:143) at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.TableId.<init>(TableId.java:73) at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.TableId.of(TableId.java:82) at com.google.cloud.bigquery.connector.common.BigQueryClient.createTempTableId(BigQueryClient.java:263) at com.google.cloud.bigquery.connector.common.BigQueryClient.createTempTable(BigQueryClient.java:229) at com.google.cloud.bigquery.connector.common.BigQueryClient.createTempTableAfterCheckingSchema(BigQueryClient.java:253) at com.google.cloud.spark.bigquery.write.BigQueryWriteHelper.writeDataFrameToBigQuery(BigQueryWriteHelper.java:142) ... 66 more ``` I see that this error occurs at these [lines in the open source connector code](https://github.com/GoogleCloudDataproc/spark-bigquery-connector/blob/master/bigquery-connector-common/src/main/java/com/google/cloud/bigquery/connector/common/BigQueryClient.java#L258-L260): ``` public TableId createTempTableId(TableId destinationTableId) { String tempProject = materializationProject.orElseGet(destinationTableId::getProject); String tempDataset = materializationDataset.orElseGet(destinationTableId::getDataset); String tableName = destinationTableId.getTable() + System.nanoTime(); TableId tempTableId = tempProject == null ? TableId.of(tempDataset, tableName) : TableId.of(tempProject, tempDataset, tableName); return tempTableId; } ``` my hunch is that we're getting our error because destinationTableId::getDataset is returning nothing. and materializationDataset is a connector property we never set since it's really for views (I think). Just a theory. to get past this weird error, i know we can actually set the connector properties materializationProject and materializationDataset https://github.com/GoogleCloudDataproc/spark-bigquery-connector?tab=readme-ov-file#properties so I did so in additional-confs-dev.yaml and reran the groupby backfill job with it: [dataproc job](https://console.cloud.google.com/dataproc/jobs/b7ebcff7-007f-43e4-9979-211803e9c700/configuration?region=us-central1&inv=1&invt=AbmqZQ&project=canary-443022) ``` Writing to BigQuery. options: Map(project -> canary-443022, writeMethod -> indirect, spark.sql.sources.partitionOverwriteMode -> DYNAMIC, partitionField -> ds, materializationDataset -> data, dataset -> data, materializationProject -> canary-443022, temporaryGcsBucket -> zl-warehouse) ``` and now i'm able to consistently get past this error Provided dataset is null or empty and "massage" the connector into just creating the temp table ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Configuration Updates** - Added new Google Cloud Storage configuration properties for Chronon integration - Specified output dataset and project details for data materialization <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…get bq connector to create temp tables. (#222) ## Summary https://app.asana.com/0/1208949807589885/1209143482009694 Debugging this during the early our clients integration ``` Caused by: java.lang.IllegalArgumentException: Provided dataset is null or empty at com.google.cloud.spark.bigquery.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:143) at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.TableId.<init>(TableId.java:73) at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.TableId.of(TableId.java:82) at com.google.cloud.bigquery.connector.common.BigQueryClient.createTempTableId(BigQueryClient.java:263) at com.google.cloud.bigquery.connector.common.BigQueryClient.createTempTable(BigQueryClient.java:229) at com.google.cloud.bigquery.connector.common.BigQueryClient.createTempTableAfterCheckingSchema(BigQueryClient.java:253) at com.google.cloud.spark.bigquery.write.BigQueryWriteHelper.writeDataFrameToBigQuery(BigQueryWriteHelper.java:142) ... 66 more ``` I see that this error occurs at these [lines in the open source connector code](https://github.com/GoogleCloudDataproc/spark-bigquery-connector/blob/master/bigquery-connector-common/src/main/java/com/google/cloud/bigquery/connector/common/BigQueryClient.java#L258-L260): ``` public TableId createTempTableId(TableId destinationTableId) { String tempProject = materializationProject.orElseGet(destinationTableId::getProject); String tempDataset = materializationDataset.orElseGet(destinationTableId::getDataset); String tableName = destinationTableId.getTable() + System.nanoTime(); TableId tempTableId = tempProject == null ? TableId.of(tempDataset, tableName) : TableId.of(tempProject, tempDataset, tableName); return tempTableId; } ``` my hunch is that we're getting our error because destinationTableId::getDataset is returning nothing. and materializationDataset is a connector property we never set since it's really for views (I think). Just a theory. to get past this weird error, i know we can actually set the connector properties materializationProject and materializationDataset https://github.com/GoogleCloudDataproc/spark-bigquery-connector?tab=readme-ov-file#properties so I did so in additional-confs-dev.yaml and reran the groupby backfill job with it: [dataproc job](https://console.cloud.google.com/dataproc/jobs/b7ebcff7-007f-43e4-9979-211803e9c700/configuration?region=us-central1&inv=1&invt=AbmqZQ&project=canary-443022) ``` Writing to BigQuery. options: Map(project -> canary-443022, writeMethod -> indirect, spark.sql.sources.partitionOverwriteMode -> DYNAMIC, partitionField -> ds, materializationDataset -> data, dataset -> data, materializationProject -> canary-443022, temporaryGcsBucket -> zl-warehouse) ``` and now i'm able to consistently get past this error Provided dataset is null or empty and "massage" the connector into just creating the temp table ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Configuration Updates** - Added new Google Cloud Storage configuration properties for Chronon integration - Specified output dataset and project details for data materialization <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…get bq connector to create temp tables. (#222) ## Summary https://app.asana.com/0/1208949807589885/1209143482009694 Debugging this during the early our clients integration ``` Caused by: java.lang.IllegalArgumentException: Provided dataset is null or empty at com.google.cloud.spark.bigquery.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:143) at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.TableId.<init>(TableId.java:73) at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.TableId.of(TableId.java:82) at com.google.cloud.bigquery.connector.common.BigQueryClient.createTempTableId(BigQueryClient.java:263) at com.google.cloud.bigquery.connector.common.BigQueryClient.createTempTable(BigQueryClient.java:229) at com.google.cloud.bigquery.connector.common.BigQueryClient.createTempTableAfterCheckingSchema(BigQueryClient.java:253) at com.google.cloud.spark.bigquery.write.BigQueryWriteHelper.writeDataFrameToBigQuery(BigQueryWriteHelper.java:142) ... 66 more ``` I see that this error occurs at these [lines in the open source connector code](https://github.com/GoogleCloudDataproc/spark-bigquery-connector/blob/master/bigquery-connector-common/src/main/java/com/google/cloud/bigquery/connector/common/BigQueryClient.java#L258-L260): ``` public TableId createTempTableId(TableId destinationTableId) { String tempProject = materializationProject.orElseGet(destinationTableId::getProject); String tempDataset = materializationDataset.orElseGet(destinationTableId::getDataset); String tableName = destinationTableId.getTable() + System.nanoTime(); TableId tempTableId = tempProject == null ? TableId.of(tempDataset, tableName) : TableId.of(tempProject, tempDataset, tableName); return tempTableId; } ``` my hunch is that we're getting our error because destinationTableId::getDataset is returning nothing. and materializationDataset is a connector property we never set since it's really for views (I think). Just a theory. to get past this weird error, i know we can actually set the connector properties materializationProject and materializationDataset https://github.com/GoogleCloudDataproc/spark-bigquery-connector?tab=readme-ov-file#properties so I did so in additional-confs-dev.yaml and reran the groupby backfill job with it: [dataproc job](https://console.cloud.google.com/dataproc/jobs/b7ebcff7-007f-43e4-9979-211803e9c700/configuration?region=us-central1&inv=1&invt=AbmqZQ&project=canary-443022) ``` Writing to BigQuery. options: Map(project -> canary-443022, writeMethod -> indirect, spark.sql.sources.partitionOverwriteMode -> DYNAMIC, partitionField -> ds, materializationDataset -> data, dataset -> data, materializationProject -> canary-443022, temporaryGcsBucket -> zl-warehouse) ``` and now i'm able to consistently get past this error Provided dataset is null or empty and "massage" the connector into just creating the temp table ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Configuration Updates** - Added new Google Cloud Storage configuration properties for Chronon integration - Specified output dataset and project details for data materialization <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…get bq connector to create temp tables. (#222) ## Summary https://app.asana.com/0/1208949807589885/1209143482009694 Debugging this during the early our clients integration ``` Caused by: java.lang.IllegalArgumentException: Provided dataset is null or empty at com.google.cloud.spark.bigquery.repaour clientsaged.com.google.common.base.Preconditions.cheour clientsArgument(Preconditions.java:143) at com.google.cloud.spark.bigquery.repaour clientsaged.com.google.cloud.bigquery.TableId.<init>(TableId.java:73) at com.google.cloud.spark.bigquery.repaour clientsaged.com.google.cloud.bigquery.TableId.of(TableId.java:82) at com.google.cloud.bigquery.connector.common.BigQueryClient.createTempTableId(BigQueryClient.java:263) at com.google.cloud.bigquery.connector.common.BigQueryClient.createTempTable(BigQueryClient.java:229) at com.google.cloud.bigquery.connector.common.BigQueryClient.createTempTableAfterCheour clientsingSchema(BigQueryClient.java:253) at com.google.cloud.spark.bigquery.write.BigQueryWriteHelper.writeDataFrameToBigQuery(BigQueryWriteHelper.java:142) ... 66 more ``` I see that this error occurs at these [lines in the open source connector code](https://github.com/GoogleCloudDataproc/spark-bigquery-connector/blob/master/bigquery-connector-common/src/main/java/com/google/cloud/bigquery/connector/common/BigQueryClient.java#L258-L260): ``` public TableId createTempTableId(TableId destinationTableId) { String tempProject = materializationProject.orElseGet(destinationTableId::getProject); String tempDataset = materializationDataset.orElseGet(destinationTableId::getDataset); String tableName = destinationTableId.getTable() + System.nanoTime(); TableId tempTableId = tempProject == null ? TableId.of(tempDataset, tableName) : TableId.of(tempProject, tempDataset, tableName); return tempTableId; } ``` my hunch is that we're getting our error because destinationTableId::getDataset is returning nothing. and materializationDataset is a connector property we never set since it's really for views (I think). Just a theory. to get past this weird error, i know we can actually set the connector properties materializationProject and materializationDataset https://github.com/GoogleCloudDataproc/spark-bigquery-connector?tab=readme-ov-file#properties so I did so in additional-confs-dev.yaml and reran the groupby baour clientsfill job with it: [dataproc job](https://console.cloud.google.com/dataproc/jobs/b7ebcff7-007f-43e4-9979-211803e9c700/configuration?region=us-central1&inv=1&invt=AbmqZQ&project=canary-443022) ``` Writing to BigQuery. options: Map(project -> canary-443022, writeMethod -> indirect, spark.sql.sources.partitionOverwriteMode -> DYNAMIC, partitionField -> ds, materializationDataset -> data, dataset -> data, materializationProject -> canary-443022, temporaryGcsBuour clientset -> zl-warehouse) ``` and now i'm able to consistently get past this error Provided dataset is null or empty and "massage" the connector into just creating the temp table ## Cheour clientslist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Configuration Updates** - Added new Google Cloud Storage configuration properties for Chronon integration - Specified output dataset and project details for data materialization <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
https://app.asana.com/0/1208949807589885/1209143482009694
Debugging this during the early etsy integration
I see that this error occurs at these lines in the open source connector code:
my hunch is that we're getting our error because destinationTableId::getDataset is returning nothing. and materializationDataset is a connector property we never set since it's really for views (I think).
Just a theory.
to get past this weird error, i know we can actually set the connector properties materializationProject and materializationDataset https://github.com/GoogleCloudDataproc/spark-bigquery-connector?tab=readme-ov-file#properties
so I did so in additional-confs-dev.yaml and reran the groupby backfill job with it:
dataproc job
and now i'm able to consistently get past this error Provided dataset is null or empty and "massage" the connector into just creating the temp table
Checklist
Summary by CodeRabbit