Skip to content

Reading table with pseudo columns has ArrayIndexOutOfBoundsException #1361

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wkeithvan opened this issue Apr 28, 2025 · 0 comments
Open

Comments

@wkeithvan
Copy link

When trying to load an entire BigQuery table that contains hidden partitions, the partitions are returned as part of the dataframe schema, but are unable to be loaded due to an array index error. It seems that the connector can only return the number of columns that are not hidden. For instance, if a table has 4 columns (id, col1, col2, col3) and pseudo columns as partitions (_PARTITIONTIME, _PARTITIONDATE) for a total of 6 columns, I can only select up to 4 columns. It doesn't matter which 4, just that it is not more than 4.

Failing to read the returned default dataframe:

>> df = spark.read.format("bigquery").options(**options).load("my_project.my_dataset.my_table")
>> df.schema
StructType([
    StructField('Id', LongType(), True),
    StructField('col1', StringType(), True),
    StructField('col2', StringType(), True),
    StructField('col3', StringType(), True),
    StructField('_PARTITIONTIME', TimestampType(), True),
    StructField('_PARTITIONDATE', DateType(), True)
])
>> df.collect()
Py4JJavaError: An error occurred while calling o1389.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 21.0 failed 4 times, most recent failure: Lost task 0.3 in stage 21.0 (TID 190) (10.96.84.117 executor 1): java.lang.ArrayIndexOutOfBoundsException

Successfully only reading the non-hidden columns:

>> df = spark.read.format("bigquery").options(**options).load("my_project.my_dataset.my_table").select(col("Id"), col("col1"), col("col2"), col("col3"))
>> df.schema
StructType([
    StructField('Id', LongType(), True),
    StructField('col1', StringType(), True),
    StructField('col2', StringType(), True),
    StructField('col3', StringType(), True),
])
>> df.collect()
[
    Row(Id=1, col1='foo', col2='bar', col3='quux'),
    ...
]

Successfully reading a mix of some hidden and non-hidden columns, but still only up to 4:

>> df = spark.read.format("bigquery").options(**options).load("my_project.my_dataset.my_table").select(col("Id"), col("col1"), col("_PARTITIONTIME"), col("_PARTITIONDATE"))
>> df.schema
StructType([
    StructField('Id', LongType(), True),
    StructField('col1', StringType(), True),
    StructField('_PARTITIONTIME', TimestampType(), True),
    StructField('_PARTITIONDATE', DateType(), True),
])
>> df.collect()
[
    Row(Id=1, col1='foo', _PARTITIONTIME=datetime.datetime(2025, 1, 1, 0, 0), _PARTITIONDATE=datetime.date(2025, 1, 1)),
    ...
]

Failing to read all six columns being explicitly defined:

>> df = spark.read.format("bigquery").options(**options).load("my_project.my_dataset.my_table").select(col("Id"), col("col1"), col("col2"), col("col3"), col("_PARTITIONTIME"), col("_PARTITIONDATE"))
>> df.schema
StructType([
    StructField('Id', LongType(), True),
    StructField('col1', StringType(), True),
    StructField('col2', StringType(), True),
    StructField('col3', StringType(), True),
    StructField('_PARTITIONTIME', TimestampType(), True),
    StructField('_PARTITIONDATE', DateType(), True)
])
>> df.collect()
Py4JJavaError: An error occurred while calling o1389.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 21.0 failed 4 times, most recent failure: Lost task 0.3 in stage 21.0 (TID 190) (10.96.84.117 executor 1): java.lang.ArrayIndexOutOfBoundsException

I am still able to reference all columns (probably because of the predicate pushdown, so I can do something like filtering on any of the 6 columns, so long as it only returns 4 or fewer columns

>> df = spark.read.format("bigquery").options(**options).load("my_project.my_dataset.my_table").filter(col("_PARTITIONDATE") == date(2025, 1, 1)).select(col("Id"), col("col1"), col("col2"), col("col3"))
>> df.schema
StructType([
    StructField('Id', LongType(), True),
    StructField('col1', StringType(), True),
    StructField('col2', StringType(), True),
    StructField('col3', StringType(), True),
])
>> df.collect()
[
    Row(Id=1, col1='foo', col2='bar', col3='quux'),
    ...
]

It seems like the correct full schema is returned on the .load(), however, when the query is run, the Spark connector has an array of the 4 non-hidden columns and returning the 6 columns is causing the array index error.

For reference, the following query works in BigQuery:

SELECT Id, col1, col2, col3, _PARTITIONTIME, _PARTITIONDATE
FROM `my_project.my_dataset.my_table`

Details:
I'm running this in Databricks on Databricks Runtime Version "16.3 ML (includes Apache Spark 3.5.2, Scala 2.12)"

Expected Behavior:
spark.read.format("bigquery").options(**options).load("my_project.my_dataset.my_table").collect() should be able to return data and not a index error. Ideally, it would contain the pseudo columns in addition to the normal columns, but it definitely shouldn't return an error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant