Skip to content

Commit a05d84f

Browse files
authored
Docs: add Unstructured.io blurb to S3 and Google Drive source connectors (#32413)
1 parent 6269b7f commit a05d84f

File tree

3 files changed

+12
-2
lines changed

3 files changed

+12
-2
lines changed

docs/integrations/sources/azure-blob-storage.md

+4-1
Original file line numberDiff line numberDiff line change
@@ -207,7 +207,10 @@ The Document File Type Format is a special format that allows you to extract tex
207207

208208
One record will be emitted for each document. Keep in mind that large files can emit large records that might not fit into every destination as each destination has different limitations for string fields.
209209

210-
To perform the text extraction from PDF and Docx files, the connector uses the [Unstructured](https://pypi.org/project/unstructured/) Python library.
210+
#### Parsing via Unstructured.io Python Library
211+
212+
This connector utilizes the open source [Unstructured](https://unstructured-io.github.io/unstructured/introduction.html#product-offerings) library to perform OCR and text extraction from PDFs and MS Word files, as well as from embedded tables and images. You can read more about the parsing logic in the [Unstructured docs](https://unstructured-io.github.io/unstructured/core/partition.html) and you can learn about other Unstructured tools and services at [www.unstructured.io](https://www.unstructured.io).
213+
211214
</FieldAnchor>
212215

213216
## Changelog

docs/integrations/sources/google-drive.md

+4
Original file line numberDiff line numberDiff line change
@@ -243,6 +243,10 @@ One record will be emitted for each document. Keep in mind that large files can
243243
244244
Before parsing each document, the connector exports Google Document files to Docx format internally. Google Sheets, Google Slides, and drawings are internally exported and parsed by the connector as PDFs.
245245
246+
#### Parsing via Unstructured.io Python Library
247+
248+
This connector utilizes the open source [Unstructured](https://unstructured-io.github.io/unstructured/introduction.html#product-offerings) library to perform OCR and text extraction from PDFs and MS Word files, as well as from embedded tables and images. You can read more about the parsing logic in the [Unstructured docs](https://unstructured-io.github.io/unstructured/core/partition.html) and you can learn about other Unstructured tools and services at [www.unstructured.io](https://www.unstructured.io).
249+
246250
## Changelog
247251
248252
| Version | Date | Pull Request | Subject |

docs/integrations/sources/s3.md

+4-1
Original file line numberDiff line numberDiff line change
@@ -318,7 +318,10 @@ The Document File Type Format is a special format that allows you to extract tex
318318

319319
One record will be emitted for each document. Keep in mind that large files can emit large records that might not fit into every destination as each destination has different limitations for string fields.
320320

321-
To perform the text extraction from PDF and Docx files, the connector uses the [Unstructured](https://pypi.org/project/unstructured/) Python library.
321+
#### Parsing via Unstructured.io Python Library
322+
323+
This connector utilizes the open source [Unstructured](https://unstructured-io.github.io/unstructured/introduction.html#product-offerings) library to perform OCR and text extraction from PDFs and MS Word files, as well as from embedded tables and images. You can read more about the parsing logic in the [Unstructured docs](https://unstructured-io.github.io/unstructured/core/partition.html) and you can learn about other Unstructured tools and services at [www.unstructured.io](https://www.unstructured.io).
324+
322325
</FieldAnchor>
323326

324327
## Changelog

0 commit comments

Comments
 (0)