Skip to content

Commit 1d3a17a

Browse files
Phlairjzhuan-icimsdavinchia
authored
🎉 Source S3 - memory & performance optimisations + advanced CSV options (#6615)
* memory & performance optimisations * address comments * version bump * added advanced_options for reading csv without header, and more custom pyarrow ReadOptions * updated to use the latest airbyte-cdk * updated docs * bump source-s3 to 0.1.6 * remove unneeded lines * Use the all dep ami for python builds. * ec2-instance-id should be ec2-image-id * ec2-instance-id should be ec2-image-id Co-authored-by: Jingkun Zhuang <[email protected]> Co-authored-by: Davin Chia <[email protected]>
1 parent 25110c1 commit 1d3a17a

File tree

15 files changed

+168
-70
lines changed

15 files changed

+168
-70
lines changed

‎.github/workflows/publish-command.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ jobs:
3434
aws-access-key-id: ${{ secrets.SELF_RUNNER_AWS_ACCESS_KEY_ID }}
3535
aws-secret-access-key: ${{ secrets.SELF_RUNNER_AWS_SECRET_ACCESS_KEY }}
3636
github-token: ${{ secrets.SELF_RUNNER_GITHUB_ACCESS_TOKEN }}
37+
ec2-image-id: ami-0d648081937c75a73
3738
publish-image:
3839
needs: start-publish-image-runner
3940
runs-on: ${{ needs.start-publish-image-runner.outputs.label }}

‎.github/workflows/test-command.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ jobs:
3333
aws-access-key-id: ${{ secrets.SELF_RUNNER_AWS_ACCESS_KEY_ID }}
3434
aws-secret-access-key: ${{ secrets.SELF_RUNNER_AWS_SECRET_ACCESS_KEY }}
3535
github-token: ${{ secrets.SELF_RUNNER_GITHUB_ACCESS_TOKEN }}
36+
ec2-image-id: ami-0d648081937c75a73
3637
integration-test:
3738
timeout-minutes: 240
3839
needs: start-test-runner

‎airbyte-config/init/src/main/resources/config/STANDARD_SOURCE_DEFINITION/69589781-7828-43c5-9f63-8925b1c1ccc2.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,6 @@
22
"sourceDefinitionId": "69589781-7828-43c5-9f63-8925b1c1ccc2",
33
"name": "S3",
44
"dockerRepository": "airbyte/source-s3",
5-
"dockerImageTag": "0.1.5",
5+
"dockerImageTag": "0.1.6",
66
"documentationUrl": "https://docs.airbyte.io/integrations/sources/s3"
77
}

‎airbyte-config/init/src/main/resources/seed/source_definitions.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@
8585
- sourceDefinitionId: 69589781-7828-43c5-9f63-8925b1c1ccc2
8686
name: S3
8787
dockerRepository: airbyte/source-s3
88-
dockerImageTag: 0.1.5
88+
dockerImageTag: 0.1.6
8989
documentationUrl: https://docs.airbyte.io/integrations/sources/s3
9090
sourceType: file
9191
- sourceDefinitionId: fbb5fbe2-16ad-4cf4-af7d-ff9d9c316c87

‎airbyte-integrations/connectors/source-s3/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ COPY source_s3 ./source_s3
1717
ENV AIRBYTE_ENTRYPOINT "python /airbyte/integration_code/main.py"
1818
ENTRYPOINT ["python", "/airbyte/integration_code/main.py"]
1919

20-
LABEL io.airbyte.version=0.1.5
20+
LABEL io.airbyte.version=0.1.6
2121
LABEL io.airbyte.name=airbyte/source-s3
2222

2323

‎airbyte-integrations/connectors/source-s3/integration_tests/spec.json

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,15 @@
9393
"{\"timestamp_parsers\": [\"%m/%d/%Y %H:%M\", \"%Y/%m/%d %H:%M\"], \"strings_can_be_null\": true, \"null_values\": [\"NA\", \"NULL\"]}"
9494
],
9595
"type": "string"
96+
},
97+
"advanced_options": {
98+
"title": "Advanced Options",
99+
"description": "Optionally add a valid JSON string here to provide additional <a href=\"https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html#pyarrow.csv.ReadOptions\" target=\"_blank\">Pyarrow ReadOptions</a>. Specify 'column_names' here if your CSV doesn't have header, or if you want to use custom column names. 'block_size' and 'encoding' are already used above, specify them again here will override the values above.",
100+
"default": "{}",
101+
"examples": [
102+
"{\"column_names\": [\"column1\", \"column2\"]}"
103+
],
104+
"type": "string"
96105
}
97106
}
98107
},

‎airbyte-integrations/connectors/source-s3/setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
from setuptools import find_packages, setup
77

88
MAIN_REQUIREMENTS = [
9-
"airbyte-cdk~=0.1.7",
9+
"airbyte-cdk~=0.1.28",
1010
"pyarrow==4.0.1",
1111
"smart-open[s3]==5.1.0",
1212
"wcmatch==8.2",

‎airbyte-integrations/connectors/source-s3/source_s3/source_files_abstract/formats/csv_parser.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,10 @@ def _read_options(self):
2828
https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html
2929
build ReadOptions object like: pa.csv.ReadOptions(**self._read_options())
3030
"""
31-
return {"block_size": self._format.get("block_size", 10000), "encoding": self._format.get("encoding", "utf8")}
31+
return {
32+
**{"block_size": self._format.get("block_size", 10000), "encoding": self._format.get("encoding", "utf8")},
33+
**json.loads(self._format.get("advanced_options", "{}")),
34+
}
3235

3336
def _parse_options(self):
3437
"""

‎airbyte-integrations/connectors/source-s3/source_s3/source_files_abstract/formats/csv_spec.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,3 +50,8 @@ class Config:
5050
'{"timestamp_parsers": ["%m/%d/%Y %H:%M", "%Y/%m/%d %H:%M"], "strings_can_be_null": true, "null_values": ["NA", "NULL"]}'
5151
],
5252
)
53+
advanced_options: str = Field(
54+
default="{}",
55+
description="Optionally add a valid JSON string here to provide additional <a href=\"https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html#pyarrow.csv.ReadOptions\" target=\"_blank\">Pyarrow ReadOptions</a>. Specify 'column_names' here if your CSV doesn't have header, or if you want to use custom column names. 'block_size' and 'encoding' are already used above, specify them again here will override the values above.",
56+
examples=["{\"column_names\": [\"column1\", \"column2\"]}"],
57+
)

‎airbyte-integrations/connectors/source-s3/source_s3/source_files_abstract/stream.py

Lines changed: 96 additions & 58 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)