Reading table with pseudo columns has ArrayIndexOutOfBoundsException #1361

wkeithvan · 2025-04-28T18:26:07Z

When trying to load an entire BigQuery table that contains hidden partitions, the partitions are returned as part of the dataframe schema, but are unable to be loaded due to an array index error. It seems that the connector can only return the number of columns that are not hidden. For instance, if a table has 4 columns (id, col1, col2, col3) and pseudo columns as partitions (_PARTITIONTIME, _PARTITIONDATE) for a total of 6 columns, I can only select up to 4 columns. It doesn't matter which 4, just that it is not more than 4.

Failing to read the returned default dataframe:

>> df = spark.read.format("bigquery").options(**options).load("my_project.my_dataset.my_table")
>> df.schema
StructType([
    StructField('Id', LongType(), True),
    StructField('col1', StringType(), True),
    StructField('col2', StringType(), True),
    StructField('col3', StringType(), True),
    StructField('_PARTITIONTIME', TimestampType(), True),
    StructField('_PARTITIONDATE', DateType(), True)
])
>> df.collect()
Py4JJavaError: An error occurred while calling o1389.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 21.0 failed 4 times, most recent failure: Lost task 0.3 in stage 21.0 (TID 190) (10.96.84.117 executor 1): java.lang.ArrayIndexOutOfBoundsException

Successfully only reading the non-hidden columns:

>> df = spark.read.format("bigquery").options(**options).load("my_project.my_dataset.my_table").select(col("Id"), col("col1"), col("col2"), col("col3"))
>> df.schema
StructType([
    StructField('Id', LongType(), True),
    StructField('col1', StringType(), True),
    StructField('col2', StringType(), True),
    StructField('col3', StringType(), True),
])
>> df.collect()
[
    Row(Id=1, col1='foo', col2='bar', col3='quux'),
    ...
]

Successfully reading a mix of some hidden and non-hidden columns, but still only up to 4:

>> df = spark.read.format("bigquery").options(**options).load("my_project.my_dataset.my_table").select(col("Id"), col("col1"), col("_PARTITIONTIME"), col("_PARTITIONDATE"))
>> df.schema
StructType([
    StructField('Id', LongType(), True),
    StructField('col1', StringType(), True),
    StructField('_PARTITIONTIME', TimestampType(), True),
    StructField('_PARTITIONDATE', DateType(), True),
])
>> df.collect()
[
    Row(Id=1, col1='foo', _PARTITIONTIME=datetime.datetime(2025, 1, 1, 0, 0), _PARTITIONDATE=datetime.date(2025, 1, 1)),
    ...
]

Failing to read all six columns being explicitly defined:

>> df = spark.read.format("bigquery").options(**options).load("my_project.my_dataset.my_table").select(col("Id"), col("col1"), col("col2"), col("col3"), col("_PARTITIONTIME"), col("_PARTITIONDATE"))
>> df.schema
StructType([
    StructField('Id', LongType(), True),
    StructField('col1', StringType(), True),
    StructField('col2', StringType(), True),
    StructField('col3', StringType(), True),
    StructField('_PARTITIONTIME', TimestampType(), True),
    StructField('_PARTITIONDATE', DateType(), True)
])
>> df.collect()
Py4JJavaError: An error occurred while calling o1389.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 21.0 failed 4 times, most recent failure: Lost task 0.3 in stage 21.0 (TID 190) (10.96.84.117 executor 1): java.lang.ArrayIndexOutOfBoundsException

I am still able to reference all columns (probably because of the predicate pushdown, so I can do something like filtering on any of the 6 columns, so long as it only returns 4 or fewer columns

>> df = spark.read.format("bigquery").options(**options).load("my_project.my_dataset.my_table").filter(col("_PARTITIONDATE") == date(2025, 1, 1)).select(col("Id"), col("col1"), col("col2"), col("col3"))
>> df.schema
StructType([
    StructField('Id', LongType(), True),
    StructField('col1', StringType(), True),
    StructField('col2', StringType(), True),
    StructField('col3', StringType(), True),
])
>> df.collect()
[
    Row(Id=1, col1='foo', col2='bar', col3='quux'),
    ...
]

It seems like the correct full schema is returned on the .load(), however, when the query is run, the Spark connector has an array of the 4 non-hidden columns and returning the 6 columns is causing the array index error.

For reference, the following query works in BigQuery:

SELECT Id, col1, col2, col3, _PARTITIONTIME, _PARTITIONDATE
FROM `my_project.my_dataset.my_table`

Details:
I'm running this in Databricks on Databricks Runtime Version "16.3 ML (includes Apache Spark 3.5.2, Scala 2.12)"

Expected Behavior:
spark.read.format("bigquery").options(**options).load("my_project.my_dataset.my_table").collect() should be able to return data and not a index error. Ideally, it would contain the pseudo columns in addition to the normal columns, but it definitely shouldn't return an error.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reading table with pseudo columns has ArrayIndexOutOfBoundsException #1361

Reading table with pseudo columns has ArrayIndexOutOfBoundsException #1361

wkeithvan commented Apr 28, 2025

Reading table with pseudo columns has ArrayIndexOutOfBoundsException #1361

Reading table with pseudo columns has ArrayIndexOutOfBoundsException #1361

Comments

wkeithvan commented Apr 28, 2025