Querying large amounts of data without creating a destination table #290

Mikerah · 2019-09-30T16:56:42Z

I'm trying to load data from BigQuery into a data frame. This is a lot of data, initially requiring the use of allowLargeResults. However, every time I query, I use resources that I shouldn't have to be using. Is there a way to query large amount of data from BigQuery without having to create a destination table every time and just store the result in a data frame?

I am using a jupyter notebook on an AWS instance, if this matters.

The text was updated successfully, but these errors were encountered:

tswast · 2019-10-08T23:38:39Z

Hi @Mikerah, I recommend you try out the use_bqstorage_api=True option, which uses the BigQuery Storage API to download results (currently in Beta). Creating a destination table shouldn't be required.

See: https://pandas-gbq.readthedocs.io/en/latest/reading.html#improving-download-performance for the full instructions.

Closing for now, but let me know if you have any questions or this doesn't work.

tswast · 2019-10-09T16:54:54Z

Are you using legacy SQL? https://cloud.google.com/bigquery/docs/writing-results#limitations

The workarounds I suggest are:

Use standard SQL dialect (recommended)
OR create a dataset with a default table expiration of a few hours. That way when you write destination tables to it, you only keep the tables around for a limited time.

jlynchMicron · 2022-07-12T19:42:48Z

Hi @tswast,

I am executing a query that should be returning ~150GB of data and I am getting the following error even when using "use_bqstorage_api=True" in the pandas "read_gbq" function.

Error: "google.api_core.exceptions.Forbidden: 403 Response too large to return. Consider specifying a destination table in your job configuration. For more details, see https://cloud.google.com/bigquery/troubleshooting-errors"

I believe this is failing in the query stage since I am not specifying a destination table to save the data to. From this issue ticket, it seems like I should not have to specify a destination table if I am using the BQ storage API. Is that true?

If this method does not work, how would you suggest executing python/pandas queries that can regularly expect large amounts of data returned? Is there a way to change BQ settings to allow large temporary tables to be created? My automated workflow will have variable size results returned so always creating tables to read data from seems like a lot of extra work to do if im unsure the result size needs it.

Thanks!

Edit:
I'm actually surprised I ran into this issue because some of the other code I have written before should have already exceeded this limit I believe. Maybe its possible there was a lot of repeat data in that query response and the total compressed size was under the 10GB limit? I am not sure.
See this for reference: googleapis/python-bigquery#1252

tswast · 2022-07-13T16:09:58Z

@jlynchMicron I think you are encountering a different error from the one described in googleapis/python-bigquery#1252

I'm not seeing a relevant limit listed here https://cloud.google.com/bigquery/quotas#query_jobs but you might be returning too many results to write to an anonymous results table. https://cloud.google.com/bigquery/docs/cached-results

Are you doing a SELECT * query? If so, you can workaround this by reading from the table directly.

data_frame = pandas_gbq.read_gbq(
    'data-project-id.dataset_id.table_id',
    project_id='billing-project-id',
)

If not, as a workaround, you can specify a destination table with the configuration argument. https://pandas-gbq.readthedocs.io/en/latest/reading.html#advanced-configuration

configuration = {
   'query': {  # https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#jobconfigurationquery
     'destinationTable': {
       # https://cloud.google.com/bigquery/docs/reference/rest/v2/TableReference
        "projectId": string,
        "datasetId": string,
        "tableId": string
     }
   }
}
data_frame = read_gbq(
    'SELECT * FROM `test_dataset.test_table` WHERE stuff > 3',
    project_id=projectid,
    configuration=configuration)

jlynchMicron · 2022-07-13T16:20:18Z

Hi @tswast

You are right it is a different error, I was just showing an example of the amount of data I could expect to receive from a past issue ticket.

As for your suggestion, is there a way for this python code to run a test query beforehand to know if a destination table will need to be created or not? The way I have query code setup, it is basically an SQLalchemy query wrapper so it can have varying result sizes from very small amounts of data to the entire table being returned (1TB+). Is there a way to dry run the query to see how much data it is expected to return to see if a destination table needs to be written or not?

Thanks!

tswast · 2022-07-13T18:02:23Z

@jlynchMicron I don't know of a good way to know the result size without actually running the query.

In the google-cloud-bigquery library, you can run the query and wait for it to finish without downloading the results. At that point, you can check the result size before downloading it, but that still doesn't help know if you need to create a destination table or not.

For your use case, I might suggest always creating a destination table but also setting an expiration time so that the result tables don't stay around for much longer than they are needed.

jlynchMicron · 2022-07-13T19:31:56Z

Hi @tswast,

I tired your first code suggestion as a test, but it fails to work:
data_frame = pandas_gbq.read_gbq( 'data-project-id.dataset_id.table_id', project_id='billing-project-id', )

Error: "Reason: 400 Syntax error: Expected end of input but got identifier" ... it goes on with my table information.

Looking at the query in BigQuery, it thinks I'm trying to execute that table identifier as a SQL query and does not know what to do with it. Based on this, I don't think you can download a table just but putting in its table identifier string.

This seems like its a problem, especially if I want to do a "Select *" on an existing table. Does that mean my only course of action to do that is save the result of a table "Select *" to another table with the query will then download from?

tswast · 2022-07-14T23:15:57Z

Hi @jlynchMicron the feature to read directly from a table was added in pandas-gbq version 0.17.0 and later. https://github.com/googleapis/python-bigquery-pandas/blob/main/CHANGELOG.md#0170-2022-01-19 That may explain why you are encountering this error.

jlynchMicron · 2022-07-15T15:54:20Z

Hi @tswast, ah looks like I am using version 0.16.0. Thanks!

jlynchMicron · 2022-08-16T22:58:18Z

Hi @tswast,

If I know my query will exceed the "max query response size" and I will have to specify a destination table to write the results to, do you know if there is any way to specify that the result table should be a temporary table?

The type of table defined in these places:

If this was possible, do you think this process could be automated into the read_gbq function as an argument flag? I am just trying to think of a nice way to download very large results for a python/pandas data scientist who does not want to deal with extra table management just to execute a query on a big dataset.

tswast · 2022-08-24T14:11:56Z

It's not possible to manually create a temporary table, but you can emulate one by creating a table with a default expiration time.

If this was possible, do you think this process could be automated into the read_gbq function as an argument flag? I am just trying to think of a nice way to download very large results for a python/pandas data scientist who does not want to deal with extra table management just to execute a query on a big dataset.

I'd like to learn a little more about the cases in which this error can occur, as I didn't realize it could happen when using Standard SQL syntax. Perhaps one of my teammates can investigate as a feature request.

If we were to implement this, I'd expect us to create a _pandas_gbq_results dataset or similar with a default table expiration for use in emulating temporary tables.

tswast added the type: question Request for information or clarification. Not an issue. label Oct 8, 2019

tswast self-assigned this Oct 8, 2019

tswast closed this as completed Oct 8, 2019

tswast reopened this Aug 24, 2022

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-pandas API. label Aug 24, 2022

tswast added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed type: question Request for information or clarification. Not an issue. api: bigquery Issues related to the googleapis/python-bigquery-pandas API. labels Aug 24, 2022

tswast removed their assignment Aug 24, 2022

meredithslota added the api: bigquery Issues related to the googleapis/python-bigquery-pandas API. label Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Querying large amounts of data without creating a destination table #290

Querying large amounts of data without creating a destination table #290

Mikerah commented Sep 30, 2019

tswast commented Oct 8, 2019

tswast commented Oct 9, 2019

jlynchMicron commented Jul 12, 2022 •

edited

Loading

tswast commented Jul 13, 2022 •

edited

Loading

jlynchMicron commented Jul 13, 2022

tswast commented Jul 13, 2022

jlynchMicron commented Jul 13, 2022

tswast commented Jul 14, 2022

jlynchMicron commented Jul 15, 2022

jlynchMicron commented Aug 16, 2022

tswast commented Aug 24, 2022

Querying large amounts of data without creating a destination table #290

Querying large amounts of data without creating a destination table #290

Comments

Mikerah commented Sep 30, 2019

tswast commented Oct 8, 2019

tswast commented Oct 9, 2019

jlynchMicron commented Jul 12, 2022 • edited Loading

tswast commented Jul 13, 2022 • edited Loading

jlynchMicron commented Jul 13, 2022

tswast commented Jul 13, 2022

jlynchMicron commented Jul 13, 2022

tswast commented Jul 14, 2022

jlynchMicron commented Jul 15, 2022

jlynchMicron commented Aug 16, 2022

tswast commented Aug 24, 2022

jlynchMicron commented Jul 12, 2022 •

edited

Loading

tswast commented Jul 13, 2022 •

edited

Loading