Skip to content

Commit e09b001

Browse files
mrhydejclazebnyi
authored andcommitted
✨ Source File: support ZIP file (airbytehq#32354)
Co-authored-by: Serhii Lazebnyi <[email protected]> Co-authored-by: Serhii Lazebnyi <[email protected]>
1 parent 83a3fe2 commit e09b001

File tree

5 files changed

+21
-14
lines changed

5 files changed

+21
-14
lines changed

airbyte-integrations/connectors/source-file/integration_tests/client_storage_providers_test.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ def check_read(config, expected_columns=10, expected_rows=42):
2525
("ssh", "files/test.csv", "csv"),
2626
("scp", "files/test.csv", "csv"),
2727
("sftp", "files/test.csv", "csv"),
28+
("ssh", "files/test.csv.zip", "csv"),
2829
("ssh", "files/test.csv.gz", "csv"), # text in binary
2930
("ssh", "files/test.pkl", "pickle"), # binary
3031
("sftp", "files/test.pkl.gz", "pickle"), # binary in binary

airbyte-integrations/connectors/source-file/metadata.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ data:
1010
connectorSubtype: file
1111
connectorType: source
1212
definitionId: 778daa7c-feaf-4db6-96f3-70fd645acc77
13-
dockerImageTag: 0.3.17
13+
dockerImageTag: 0.4.0
1414
dockerRepository: airbyte/source-file
1515
documentationUrl: https://docs.airbyte.com/integrations/sources/file
1616
githubIssueLabel: source-file

airbyte-integrations/connectors/source-file/unit_tests/test_client.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,11 @@ def test_cache_stream(client, absolute_path, test_files):
143143
f = f"{absolute_path}/{test_files}/test.csv"
144144
with open(f, mode="rb") as file:
145145
assert client._cache_stream(file)
146+
147+
def test_unzip_stream(client, absolute_path, test_files):
148+
f = f"{absolute_path}/{test_files}/test.csv.zip"
149+
with open(f, mode="rb") as file:
150+
assert client._unzip(file)
146151

147152

148153
def test_open_aws_url():

docs/integrations/sources/file.md

Lines changed: 14 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -126,7 +126,7 @@ This connector does not support syncing unstructured data files such as raw text
126126
## Supported sync modes
127127

128128
| Feature | Supported? |
129-
| ---------------------------------------- | ---------- |
129+
|------------------------------------------|------------|
130130
| Full Refresh Sync | Yes |
131131
| Incremental Sync | No |
132132
| Replicate Incremental Deletes | No |
@@ -140,9 +140,9 @@ This source produces a single table for the target file as it replicates only on
140140
## File / Stream Compression
141141

142142
| Compression | Supported? |
143-
| ----------- | ---------- |
143+
|-------------|------------|
144144
| Gzip | Yes |
145-
| Zip | No |
145+
| Zip | Yes |
146146
| Bzip2 | No |
147147
| Lzma | No |
148148
| Xz | No |
@@ -151,7 +151,7 @@ This source produces a single table for the target file as it replicates only on
151151
## Storage Providers
152152

153153
| Storage Providers | Supported? |
154-
| ---------------------- | ----------------------------------------------- |
154+
|------------------------|-------------------------------------------------|
155155
| HTTPS | Yes |
156156
| Google Cloud Storage | Yes |
157157
| Amazon Web Services S3 | Yes |
@@ -162,7 +162,7 @@ This source produces a single table for the target file as it replicates only on
162162
### File Formats
163163

164164
| Format | Supported? |
165-
| --------------------- | ---------- |
165+
|-----------------------|------------|
166166
| CSV | Yes |
167167
| JSON/JSONL | Yes |
168168
| HTML | No |
@@ -183,24 +183,24 @@ Normally, Airbyte tries to infer the data type from the source, but you can use
183183

184184
Here are a list of examples of possible file inputs:
185185

186-
| Dataset Name | Storage | URL | Reader Impl | Service Account | Description |
187-
| ----------------- | ------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------ | -------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
188-
| epidemiology | HTTPS | [https://storage.googleapis.com/covid19-open-data/v2/latest/epidemiology.csv](https://storage.googleapis.com/covid19-open-data/v2/latest/epidemiology.csv) | | | [COVID-19 Public dataset](https://console.cloud.google.com/marketplace/product/bigquery-public-datasets/covid19-public-data-program?filter=solution-type:dataset&id=7d6cc408-53c8-4485-a187-b8cb9a5c0b56) on BigQuery |
186+
| Dataset Name | Storage | URL | Reader Impl | Service Account | Description |
187+
|-------------------|---------|------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------|------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
188+
| epidemiology | HTTPS | [https://storage.googleapis.com/covid19-open-data/v2/latest/epidemiology.csv](https://storage.googleapis.com/covid19-open-data/v2/latest/epidemiology.csv) | | | [COVID-19 Public dataset](https://console.cloud.google.com/marketplace/product/bigquery-public-datasets/covid19-public-data-program?filter=solution-type:dataset&id=7d6cc408-53c8-4485-a187-b8cb9a5c0b56) on BigQuery |
189189
| hr_and_financials | GCS | gs://airbyte-vault/financial.csv | smart_open or gcfs | `{"type": "service_account", "private_key_id": "XXXXXXXX", ...}` | data from a private bucket, a service account is necessary |
190-
| landsat_index | GCS | gcp-public-data-landsat/index.csv.gz | smart_open | | Using smart_open, we don't need to specify the compression (note the gs:// is optional too, same for other providers) |
190+
| landsat_index | GCS | gcp-public-data-landsat/index.csv.gz | smart_open | | Using smart_open, we don't need to specify the compression (note the gs:// is optional too, same for other providers) |
191191

192192
Examples with reader options:
193193

194-
| Dataset Name | Storage | URL | Reader Impl | Reader Options | Description |
195-
| ------------- | ------- | ----------------------------------------------- | ----------- | ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
194+
| Dataset Name | Storage | URL | Reader Impl | Reader Options | Description |
195+
|---------------|---------|-------------------------------------------------|-------------|---------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|
196196
| landsat_index | GCS | gs://gcp-public-data-landsat/index.csv.gz | GCFS | `{"compression": "gzip"}` | Additional reader options to specify a compression option to `read_csv` |
197197
| GDELT | S3 | s3://gdelt-open-data/events/20190914.export.csv | | `{"sep": "\t", "header": null}` | Here is TSV data separated by tabs without header row from [AWS Open Data](https://registry.opendata.aws/gdelt/) |
198198
| server_logs | local | /local/logs.log | | `{"sep": ";"}` | After making sure a local text file exists at `/tmp/airbyte_local/logs.log` with logs file from some server that are delimited by ';' delimiters |
199199

200200
Example for SFTP:
201201

202-
| Dataset Name | Storage | User | Password | Host | URL | Reader Options | Description |
203-
| ------------ | ------- | ---- | -------- | --------------- | ----------------------- | ----------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------- |
202+
| Dataset Name | Storage | User | Password | Host | URL | Reader Options | Description |
203+
|--------------|---------|------|----------|-----------------|-------------------------|---------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------|
204204
| Test Rebext | SFTP | demo | password | test.rebext.net | /pub/example/readme.txt | `{"sep": "\r\n", "header": null, "names": \["text"], "engine": "python"}` | We use `python` engine for `read_csv` in order to handle delimiter of more than 1 character while providing our own column names. |
205205

206206
Please see (or add) more at `airbyte-integrations/connectors/source-file/integration_tests/integration_source_test.py` for further usages examples.
@@ -217,6 +217,7 @@ In order to read large files from a remote location, this connector uses the [sm
217217

218218
| Version | Date | Pull Request | Subject |
219219
|:--------|:-----------|:---------------------------------------------------------|:--------------------------------------------------------------------------------------------------------|
220+
| 0.4.0 | 2024-02-15 | [32354](https://github.com/airbytehq/airbyte/pull/32354) | Add Zip File Support |
220221
| 0.3.17 | 2024-02-13 | [34678](https://github.com/airbytehq/airbyte/pull/34678) | Add Fixed-Width File Support |
221222
| 0.3.16 | 2024-02-12 | [35186](https://github.com/airbytehq/airbyte/pull/35186) | Manage dependencies with Poetry |
222223
| 0.3.15 | 2023-10-19 | [31599](https://github.com/airbytehq/airbyte/pull/31599) | Upgrade to airbyte/python-connector-base:1.0.1 |

0 commit comments

Comments
 (0)