You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When trying to load an entire BigQuery table that contains hidden partitions, the partitions are returned as part of the dataframe schema, but are unable to be loaded due to an array index error. It seems that the connector can only return the number of columns that are not hidden. For instance, if a table has 4 columns (id, col1, col2, col3) and pseudo columns as partitions (_PARTITIONTIME, _PARTITIONDATE) for a total of 6 columns, I can only select up to 4 columns. It doesn't matter which 4, just that it is not more than 4.
Failing to read the returned default dataframe:
>> df = spark.read.format("bigquery").options(**options).load("my_project.my_dataset.my_table")
>> df.schema
StructType([
StructField('Id', LongType(), True),
StructField('col1', StringType(), True),
StructField('col2', StringType(), True),
StructField('col3', StringType(), True),
StructField('_PARTITIONTIME', TimestampType(), True),
StructField('_PARTITIONDATE', DateType(), True)
])
>> df.collect()
Py4JJavaError: An error occurred while calling o1389.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 21.0 failed 4 times, most recent failure: Lost task 0.3 in stage 21.0 (TID 190) (10.96.84.117 executor 1): java.lang.ArrayIndexOutOfBoundsException
Failing to read all six columns being explicitly defined:
>> df = spark.read.format("bigquery").options(**options).load("my_project.my_dataset.my_table").select(col("Id"), col("col1"), col("col2"), col("col3"), col("_PARTITIONTIME"), col("_PARTITIONDATE"))
>> df.schema
StructType([
StructField('Id', LongType(), True),
StructField('col1', StringType(), True),
StructField('col2', StringType(), True),
StructField('col3', StringType(), True),
StructField('_PARTITIONTIME', TimestampType(), True),
StructField('_PARTITIONDATE', DateType(), True)
])
>> df.collect()
Py4JJavaError: An error occurred while calling o1389.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 21.0 failed 4 times, most recent failure: Lost task 0.3 in stage 21.0 (TID 190) (10.96.84.117 executor 1): java.lang.ArrayIndexOutOfBoundsException
I am still able to reference all columns (probably because of the predicate pushdown, so I can do something like filtering on any of the 6 columns, so long as it only returns 4 or fewer columns
It seems like the correct full schema is returned on the .load(), however, when the query is run, the Spark connector has an array of the 4 non-hidden columns and returning the 6 columns is causing the array index error.
For reference, the following query works in BigQuery:
SELECT Id, col1, col2, col3, _PARTITIONTIME, _PARTITIONDATE
FROM `my_project.my_dataset.my_table`
Details:
I'm running this in Databricks on Databricks Runtime Version "16.3 ML (includes Apache Spark 3.5.2, Scala 2.12)"
Expected Behavior: spark.read.format("bigquery").options(**options).load("my_project.my_dataset.my_table").collect() should be able to return data and not a index error. Ideally, it would contain the pseudo columns in addition to the normal columns, but it definitely shouldn't return an error.
The text was updated successfully, but these errors were encountered:
When trying to load an entire BigQuery table that contains hidden partitions, the partitions are returned as part of the dataframe schema, but are unable to be loaded due to an array index error. It seems that the connector can only return the number of columns that are not hidden. For instance, if a table has 4 columns (id, col1, col2, col3) and pseudo columns as partitions (_PARTITIONTIME, _PARTITIONDATE) for a total of 6 columns, I can only select up to 4 columns. It doesn't matter which 4, just that it is not more than 4.
Failing to read the returned default dataframe:
Successfully only reading the non-hidden columns:
Successfully reading a mix of some hidden and non-hidden columns, but still only up to 4:
Failing to read all six columns being explicitly defined:
I am still able to reference all columns (probably because of the predicate pushdown, so I can do something like filtering on any of the 6 columns, so long as it only returns 4 or fewer columns
It seems like the correct full schema is returned on the
.load()
, however, when the query is run, the Spark connector has an array of the 4 non-hidden columns and returning the 6 columns is causing the array index error.For reference, the following query works in BigQuery:
Details:
I'm running this in Databricks on Databricks Runtime Version "16.3 ML (includes Apache Spark 3.5.2, Scala 2.12)"
Expected Behavior:
spark.read.format("bigquery").options(**options).load("my_project.my_dataset.my_table").collect()
should be able to return data and not a index error. Ideally, it would contain the pseudo columns in addition to the normal columns, but it definitely shouldn't return an error.The text was updated successfully, but these errors were encountered: