You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: TableUtils to be compatible with DataPointer (part 1) (#158)
## Summary
- Fixing some earlier bugs. Doing case matching on java isn't the same
with scala classes.
- Adding a few more params to GCS format. Need the source URI, and the
`format` string (which is basically the file format).
- Deleting some queries to INFORMATION_SCHEMA in GCS format. no longer
needed since we are using the BQ Client.
- Adding some code to handle Spark InternalRows. We are using a low
level impl to get at the InMemoryFileIndex which contains file
partitions. It gives us internal rows so we need to translate that to
rows, which involves the correct serialization based on the column
types.
- Add a couple tests to BigQueryCatalogTest
- Adding a `name` field to `Format`
- Begin to migrate some TableUtils methods to delegate to DataPointer.
- https://app.asana.com/0/1208949807589885/1208960391734329/f
- https://app.asana.com/0/1208949807589885/1208960391734331/f
## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Enhanced BigQuery and GCS format handling with improved table name
resolution and data source support.
- Updated Spark table utilities with more robust data loading and
management capabilities.
- Introduced new methods for resolving table names and handling data
formats.
- Added support for new dependencies related to Google Cloud Dataproc.
- Introduced unit tests for GCS format functionality.
- **Bug Fixes**
- Improved error handling for data source formats and table operations.
- Streamlined data pointer operations for better format compatibility.
- **Refactor**
- Simplified data loading and schema retrieval methods.
- Consolidated format handling logic in data source operations.
- Enhanced organization and clarity in data pointer handling.
- Cleaned up dependency declarations and project settings in build
configuration.
- Improved error handling and control flow in join computation
processes.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- av pr metadata
This information is embedded by the av CLI when creating PRs to track
the status of stacks when using Aviator. Please do not delete or edit
this section of the PR.
```
{"parent":"main","parentHead":"","trunk":"main"}
```
-->
---------
Co-authored-by: Thomas Chow <[email protected]>
libraryDependencies +="com.google.cloud.bigdataoss"%"gcsio"%"3.0.3", // need it for https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcsio/src/main/java/com/google/cloud/hadoop/gcsio/GoogleCloudStorageFileSystem.java
218
+
libraryDependencies +="com.google.cloud.bigdataoss"%"util-hadoop"%"3.0.0", // need it for https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/util-hadoop/src/main/java/com/google/cloud/hadoop/util/HadoopConfigurationProperty.java
elsethrownewIllegalStateException(s"Cannot support table of type: ${table.getDefinition}")
68
+
})
69
+
.getOrElse(Hive)
34
70
35
71
/**
36
-
Using federation
37
-
val tableIdentifier = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
38
-
val tableMeta = sparkSession.sessionState.catalog.getTableRawMetadata(tableIdentifier)
39
-
val storageProvider = tableMeta.provider
40
-
storageProvider match {
41
-
case Some("com.google.cloud.spark.bigquery") => {
42
-
val tableProperties = tableMeta.properties
43
-
val project = tableProperties
44
-
.get("FEDERATION_BIGQUERY_TABLE_PROPERTY")
45
-
.map(BigQueryUtil.parseTableId)
46
-
.map(_.getProject)
47
-
.getOrElse(throw new IllegalStateException("bigquery project required!"))
48
-
val bigQueryTableType = tableProperties.get("federation.bigquery.table.type")
49
-
bigQueryTableType.map(_.toUpperCase) match {
50
-
case Some("EXTERNAL") => GCS(project)
51
-
case Some("MANAGED") => BQuery(project)
52
-
case None => throw new IllegalStateException("Dataproc federation service must be available.")
53
-
54
-
}
55
-
}
56
-
57
-
case Some("hive") | None => Hive
58
-
}
72
+
* Using federation
73
+
* val tableIdentifier = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
74
+
* val tableMeta = sparkSession.sessionState.catalog.getTableRawMetadata(tableIdentifier)
75
+
* val storageProvider = tableMeta.provider
76
+
* storageProvider match {
77
+
* case Some("com.google.cloud.spark.bigquery") => {
78
+
* val tableProperties = tableMeta.properties
79
+
* val project = tableProperties
80
+
* .get("FEDERATION_BIGQUERY_TABLE_PROPERTY")
81
+
* .map(BigQueryUtil.parseTableId)
82
+
* .map(_.getProject)
83
+
* .getOrElse(throw new IllegalStateException("bigquery project required!"))
84
+
* val bigQueryTableType = tableProperties.get("federation.bigquery.table.type")
85
+
* bigQueryTableType.map(_.toUpperCase) match {
86
+
* case Some("EXTERNAL") => GCS(project)
87
+
* case Some("MANAGED") => BQuery(project)
88
+
* case None => throw new IllegalStateException("Dataproc federation service must be available.")
89
+
*
90
+
* }
91
+
*
92
+
* case Some("hive") | None => Hive
93
+
* }
59
94
* */
60
-
61
95
}
62
-
63
-
// For now, fix to BigQuery. We'll clean this up.
64
-
defwriteFormat(tableName: String):Format=???
65
96
}
66
97
67
98
caseclassBQuery(project: String) extendsFormat {
@@ -120,6 +151,13 @@ case class BQuery(project: String) extends Format {
120
151
.option("project", project)
121
152
.option("query", partValsSql)
122
153
.load()
154
+
.select(
155
+
to_date(col("partition_id"),
156
+
"yyyyMMdd"
157
+
) // Note: this "yyyyMMdd" format is hardcoded but we need to change it to be something else.
158
+
.as("partition_id"))
159
+
.na // Should filter out '__NULL__' and '__UNPARTITIONED__'. See: https://cloud.google.com/bigquery/docs/partitioned-tables#date_timestamp_partitioned_tables
0 commit comments