You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/integrations/sources/azure-blob-storage.md
+45-13
Original file line number
Diff line number
Diff line change
@@ -6,6 +6,33 @@ This page contains the setup guide and reference information for the Azure Blob
6
6
Cloud storage may incur egress costs. Egress refers to data that is transferred out of the cloud storage system, such as when you download files or access them from a different location. For more information, see the [Azure Blob Storage pricing guide](https://azure.microsoft.com/en-us/pricing/details/storage/blobs/).
7
7
:::
8
8
9
+
## Prerequisites
10
+
11
+
- Tenant ID of the Microsoft Azure Application user
12
+
- Azure Blob Storage account name
13
+
- Azure blob storage container (Bucket) Name
14
+
15
+
<details>
16
+
<summary>
17
+
Minimum permissions (role [Storage Blob Data Reader](https://learn.microsoft.com/en-us/azure/role-based-access-control/built-in-roles/storage#storage-blob-data-reader) ):
@@ -20,7 +47,7 @@ to use role [Storage Blob Data Reader](https://learn.microsoft.com/en-gb/azure/s
20
47
21
48
<details>
22
49
<summary>
23
-
Follow this steps to setup IAM role:
50
+
Follow these steps to set up an IAM role:
24
51
</summary>
25
52
26
53
1. Go to Azure portal, select the Storage (or Container) you'd like to sync from and get to Access Control(IAM) -> Role Assignment 
@@ -38,19 +65,19 @@ Follow this steps to setup IAM role:
38
65
2. In the left navigation bar, click **Sources**. In the top-right corner, click **+ New source**.
39
66
3. Find and select **Azure Blob Storage** from the list of available sources.
40
67
4. Enter the name of your Azure **Account**.
41
-
5. Click **Authenticate your Azure Blob Storage account**.
68
+
5.Enter your Tenant ID and Click **Authenticate your Azure Blob Storage account**.
42
69
6. Log in and authorize the Azure Blob Storage account.
43
70
7. Enter the name of the **Container** containing your files to replicate.
44
71
8. Add a stream
45
72
1. Write the **File Type**
46
73
2. In the **Format** box, use the dropdown menu to select the format of the files you'd like to replicate. The supported formats are **CSV**, **Parquet**, **Avro** and **JSONL**. Toggling the **Optional fields** button within the **Format** box will allow you to enter additional configurations based on the selected format. For a detailed breakdown of these settings, refer to the [File Format section](#file-format-settings) below.
47
74
3. Give a **Name** to the stream
48
-
4. (Optional) - If you want to enforce a specific schema, you can enter a **Input schema**. By default, this value is set to `{}` and will automatically infer the schema from the file\(s\) you are replicating. For details on providing a custom schema, refer to the [User Schema section](#user-schema).
75
+
4. (Optional)—If you want to enforce a specific schema, you can enter a **Input schema**. By default, this value is set to `{}` and will automatically infer the schema from the file\(s\) you are replicating. For details on providing a custom schema, refer to the [User Schema section](#user-schema).
49
76
5. Optionally, enter the **Globs** which dictates which files to be synced. This is a regular expression that allows Airbyte to pattern match the specific files to replicate. If you are replicating all the files within your bucket, use `**` as the pattern. For more precise pattern matching options, refer to the [Path Patterns section](#path-patterns) below.
50
77
9. (Optional) Enter the endpoint to use for the data replication.
51
78
10. (Optional) Enter the desired start date from which to begin replicating data.
52
79
53
-
## Supported sync modes
80
+
## Supported Streams
54
81
55
82
The Azure Blob Storage source connector supports the following [sync modes](https://docs.airbyte.com/cloud/core-concepts#connection-sync-modes):
56
83
@@ -63,7 +90,7 @@ The Azure Blob Storage source connector supports the following [sync modes](http
@@ -76,7 +103,7 @@ The Azure Blob Storage source connector supports the following [sync modes](http
76
103
77
104
Please let us know any specific compressions you'd like to see support for next!
78
105
79
-
## Path Patterns
106
+
###Path Patterns
80
107
81
108
\(tl;dr -> path pattern syntax using [wcmatch.glob](https://facelessuser.github.io/wcmatch/glob/). GLOBSTAR and SPLIT flags are enabled.\)
82
109
@@ -126,7 +153,7 @@ We want to pick up part1.csv, part2.csv and part3.csv \(excluding another_part1.
126
153
127
154
As you can probably tell, there are many ways to achieve the same goal with path patterns. We recommend using a pattern that ensures clarity and is robust against future additions to the directory structure.
128
155
129
-
## User Schema
156
+
###User Schema
130
157
131
158
Providing a schema allows for more control over the output of this stream. Without a provided schema, columns and datatypes will be inferred from the first created file in the bucket matching your path pattern and suffix. This will probably be fine in most cases but there may be situations you want to enforce a schema instead, e.g.:
Since CSV files are effectively plain text, providing specific reader options is often required for correct parsing of the files. These settings are applied when a CSV is created or exported so please ensure that this process happens consistently over time.
158
185
@@ -180,24 +207,24 @@ Leaving this field blank (default option) will disallow escaping.
180
207
-**True Values**: A set of case-sensitive strings that should be interpreted as true values.
181
208
182
209
183
-
### Parquet
210
+
####Parquet
184
211
185
212
Apache Parquet is a column-oriented data storage format of the Apache Hadoop ecosystem. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. At the moment, partitioned parquet datasets are unsupported. The following settings are available:
186
213
187
214
-**Convert Decimal Fields to Floats**: Whether to convert decimal fields to floats. There is a loss of precision when converting decimals to floats, so this is not recommended.
188
215
189
-
### Avro
216
+
####Avro
190
217
191
218
The Avro parser uses the [Fastavro library](https://fastavro.readthedocs.io/en/latest/). The following settings are available:
192
219
-**Convert Double Fields to Strings**: Whether to convert double fields to strings. This is recommended if you have decimal numbers with a high degree of precision because there can be a loss precision when handling floating point numbers.
The Document File Type Format is currently an experimental feature and not subject to SLAs. Use at your own risk.
@@ -213,10 +240,15 @@ This connector utilizes the open source [Unstructured](https://unstructured-io.g
213
240
214
241
</FieldAnchor>
215
242
243
+
## Performance considerations
244
+
245
+
The Azure Blob Storage connector should not encounter any [Microsoft API limitations](https://learn.microsoft.com/en-us/azure/storage/blobs/scalability-targets#scale-targets-for-blob-storage) under normal usage.
0 commit comments