Skip to content

Commit bb9d374

Browse files
✨ [source-google-sheets] add row_batch_size as an input parameter with higher increase (#35404)
Co-authored-by: Marcos Marx <[email protected]> Co-authored-by: marcosmarxm <[email protected]>
1 parent b7819d9 commit bb9d374

File tree

7 files changed

+36
-9
lines changed

7 files changed

+36
-9
lines changed

airbyte-integrations/connectors/source-google-sheets/metadata.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ data:
1010
connectorSubtype: file
1111
connectorType: source
1212
definitionId: 71607ba1-c0ac-4799-8049-7f4b90dd50f7
13-
dockerImageTag: 0.5.0
13+
dockerImageTag: 0.5.1
1414
dockerRepository: airbyte/source-google-sheets
1515
documentationUrl: https://docs.airbyte.com/integrations/sources/google-sheets
1616
githubIssueLabel: source-google-sheets

airbyte-integrations/connectors/source-google-sheets/pyproject.toml

+1-1
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ requires = [ "poetry-core>=1.0.0",]
33
build-backend = "poetry.core.masonry.api"
44

55
[tool.poetry]
6-
version = "0.5.0"
6+
version = "0.5.1"
77
name = "source-google-sheets"
88
description = "Source implementation for Google Sheets."
99
authors = [ "Airbyte <[email protected]>",]

airbyte-integrations/connectors/source-google-sheets/source_google_sheets/client.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ class Backoff:
2121
@classmethod
2222
def increase_row_batch_size(cls, details):
2323
if details["exception"].status_code == status_codes.TOO_MANY_REQUESTS and cls.row_batch_size < 1000:
24-
cls.row_batch_size = cls.row_batch_size + 10
24+
cls.row_batch_size = cls.row_batch_size + 100
2525
logger.info(f"Increasing number of records fetching due to rate limits. Current value: {cls.row_batch_size}")
2626

2727
@staticmethod

airbyte-integrations/connectors/source-google-sheets/source_google_sheets/source.py

+1
Original file line numberDiff line numberDiff line change
@@ -149,6 +149,7 @@ def _read(
149149
catalog: ConfiguredAirbyteCatalog,
150150
) -> Generator[AirbyteMessage, None, None]:
151151
client = GoogleSheetsClient(self.get_credentials(config))
152+
client.Backoff.row_batch_size = config.get("batch_size", 200)
152153

153154
sheet_to_column_name = Helpers.parse_sheet_and_column_names_from_catalog(catalog)
154155
stream_name_to_stream = {stream.stream.name: stream for stream in catalog.streams}

airbyte-integrations/connectors/source-google-sheets/source_google_sheets/spec.yaml

+15
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,21 @@ connectionSpecification:
88
- credentials
99
additionalProperties: true
1010
properties:
11+
batch_size:
12+
type: integer
13+
title: Row Batch Size
14+
description: >-
15+
Default value is 200.
16+
An integer representing row batch size for each sent request to Google Sheets API.
17+
Row batch size means how many rows are processed from the google sheet, for example default value 200
18+
would process rows 1-201, then 201-401 and so on.
19+
Based on <a href='https://developers.google.com/sheets/api/limits'>Google Sheets API limits documentation</a>,
20+
it is possible to send up to 300 requests per minute, but each individual request has to be processed under 180 seconds,
21+
otherwise the request returns a timeout error. In regards to this information, consider network speed and
22+
number of columns of the google sheet when deciding a batch_size value.
23+
Default value should cover most of the cases, but if a google sheet has over 100,000 records or more,
24+
consider increasing batch_size value.
25+
default: 200
1126
spreadsheet_id:
1227
type: string
1328
title: Spreadsheet Link

airbyte-integrations/connectors/source-google-sheets/unit_tests/test_client.py

+4-4
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,8 @@ def test_backoff_increase_row_batch_size():
2424
e = requests.HTTPError("error")
2525
e.status_code = 429
2626
client.Backoff.increase_row_batch_size({"exception": e})
27-
assert client.Backoff.row_batch_size == 210
28-
assert client._create_range("spreadsheet_id", 0) == "spreadsheet_id!0:210"
27+
assert client.Backoff.row_batch_size == 300
28+
assert client._create_range("spreadsheet_id", 0) == "spreadsheet_id!0:300"
2929
client.Backoff.row_batch_size = 1000
3030
client.Backoff.increase_row_batch_size({"exception": e})
3131
assert client.Backoff.row_batch_size == 1000
@@ -57,12 +57,12 @@ def test_client_get_values_on_backoff(caplog):
5757
e = requests.HTTPError("error")
5858
e.status_code = 429
5959
client_google_sheets.Backoff.increase_row_batch_size({"exception": e})
60-
assert client_google_sheets.Backoff.row_batch_size == 220
60+
assert client_google_sheets.Backoff.row_batch_size == 310
6161
client_google_sheets.get_values(
6262
sheet="sheet",
6363
row_cursor=0,
6464
spreadsheetId="spreadsheet_id",
6565
majorDimension="ROWS",
6666
)
6767

68-
assert "Fetching range sheet!0:220" in caplog.text
68+
assert "Fetching range sheet!0:310" in caplog.text

docs/integrations/sources/google-sheets.md

+13-2
Original file line numberDiff line numberDiff line change
@@ -97,8 +97,18 @@ If your spreadsheet is viewable by anyone with its link, no further action is ne
9797
- To authenticate your Google account via OAuth, select **Authenticate via Google (OAuth)** from the dropdown and enter your Google application's client ID, client secret, and refresh token.
9898
<!-- /env:oss -->
9999
6. For **Spreadsheet Link**, enter the link to the Google spreadsheet. To get the link, go to the Google spreadsheet you want to sync, click **Share** in the top right corner, and click **Copy Link**.
100-
7. (Optional) You may enable the option to **Convert Column Names to SQL-Compliant Format**. Enabling this option will allow the connector to convert column names to a standardized, SQL-friendly format. For example, a column name of `Café Earnings 2022` will be converted to `cafe_earnings_2022`. We recommend enabling this option if your target destination is SQL-based (ie Postgres, MySQL). Set to false by default.
101-
8. Click **Set up source** and wait for the tests to complete.
100+
7. For **Batch Size**, enter an integer which represents batch size when processing a Google Sheet. Default value is 200.
101+
Batch size is an integer representing row batch size for each sent request to Google Sheets API.
102+
Row batch size means how many rows are processed from the google sheet, for example default value 200
103+
would process rows 1-201, then 201-401 and so on.
104+
Based on [Google Sheets API limits documentation](https://developers.google.com/sheets/api/limits),
105+
it is possible to send up to 300 requests per minute, but each individual request has to be processed under 180 seconds,
106+
otherwise the request returns a timeout error. In regards to this information, consider network speed and
107+
number of columns of the google sheet when deciding a batch_size value.
108+
Default value should cover most of the cases, but if a google sheet has over 100,000 records or more,
109+
consider increasing batch_size value.
110+
8. (Optional) You may enable the option to **Convert Column Names to SQL-Compliant Format**. Enabling this option will allow the connector to convert column names to a standardized, SQL-friendly format. For example, a column name of `Café Earnings 2022` will be converted to `cafe_earnings_2022`. We recommend enabling this option if your target destination is SQL-based (ie Postgres, MySQL). Set to false by default.
111+
9. Click **Set up source** and wait for the tests to complete.
102112

103113
<HideInUI>
104114

@@ -151,6 +161,7 @@ Airbyte batches requests to the API in order to efficiently pull data and respec
151161

152162
| Version | Date | Pull Request | Subject |
153163
|---------|------------|----------------------------------------------------------|-----------------------------------------------------------------------------------|
164+
| 0.5.1 | 2024-04-11 | [35404](https://github.com/airbytehq/airbyte/pull/35404) | Add `row_batch_size` parameter more granular control read records |
154165
| 0.5.0 | 2024-03-26 | [36515](https://github.com/airbytehq/airbyte/pull/36515) | Resolve poetry dependency conflict, add record counts to state messages |
155166
| 0.4.0 | 2024-03-19 | [36267](https://github.com/airbytehq/airbyte/pull/36267) | Pin airbyte-cdk version to `^0` |
156167
| 0.3.17 | 2024-02-29 | [35722](https://github.com/airbytehq/airbyte/pull/35722) | Add logic to emit stream statuses |

0 commit comments

Comments
 (0)