Skip to content

Commit d40262d

Browse files
authored
Merge pull request #5 from whyhow-ai/unstructured-integration
2 parents f601bfe + 7967a4e commit d40262d

File tree

11 files changed

+359
-75
lines changed

11 files changed

+359
-75
lines changed

.python-version

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
3.10

README.md

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ Our goal is to provide a familiar, spreadsheet-like interface for business users
66

77
[![MIT License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
88
[![GitHub issues](https://img.shields.io/github/issues/whyhow-ai/knowledge-table)](https://github.com/whyhow-ai/knowledge-table/issues)
9+
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
910

1011
For a limited demo, check out the [Knowledge Table Demo](https://knowledge-table-demo.whyhow.ai/).
1112

@@ -98,10 +99,26 @@ The frontend can be accessed at `http://localhost:3000`, and the backend can be
9899

99100
4. **Install the dependencies:**
100101

102+
For basic installation:
101103
```sh
102104
pip install .
103105
```
104106

107+
For installation with Unstructured support:
108+
```sh
109+
pip install .[unstructured]
110+
```
111+
112+
For installation with development tools:
113+
```sh
114+
pip install .[dev]
115+
```
116+
117+
For full installation (including Unstructured and dev tools):
118+
```sh
119+
pip install .[full]
120+
```
121+
105122
5. **Start the backend:**
106123

107124
```sh
@@ -150,6 +167,27 @@ The frontend can be accessed at `http://localhost:3000`, and the backend can be
150167
151168
---
152169

170+
## Development
171+
172+
To set up the project for development:
173+
174+
1. Clone the repository
175+
2. Install the project with development dependencies:
176+
```sh
177+
pip install .[dev]
178+
```
179+
3. Run tests:
180+
```sh
181+
pytest
182+
```
183+
4. Run linters:
184+
```sh
185+
flake8
186+
black .
187+
isort .
188+
```
189+
---
190+
153191
## Features
154192

155193
### Available in this repo:
@@ -205,6 +243,29 @@ Knowledge Table is built to be flexible and customizable, allowing you to extend
205243

206244
---
207245

246+
## Optional Integrations
247+
248+
### Unstructured API
249+
250+
Knowledge Table offers optional integration with the Unstructured API for enhanced document processing capabilities. This integration allows for more advanced parsing and extraction from various document types.
251+
252+
To use the Unstructured API integration:
253+
254+
1. Sign up for an API key at [Unstructured.io](https://www.unstructured.io/).
255+
2. Set the `UNSTRUCTURED_API_KEY` environment variable in the `.env` file, or with your API key:
256+
```
257+
export UNSTRUCTURED_API_KEY=your_api_key_here
258+
```
259+
3. Install the project with Unstructured support:
260+
```
261+
pip install .[unstructured]
262+
```
263+
264+
When the `UNSTRUCTURED_API_KEY` is set, Knowledge Table will automatically use the Unstructured API for document processing. If the key is not set or if there's an issue with the Unstructured API, the system will fall back to the default document loaders.
265+
266+
Note: Usage of the Unstructured API may incur costs based on your plan with Unstructured.io.
267+
---
268+
208269
## Roadmap
209270

210271
- [ ] Support for more LLMs

backend/.env.sample

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,4 +17,9 @@ MILVUS_DB_PASSWORD={your-milvus-password}
1717
# -------------------------
1818
# QUERY CONFIG
1919
# -------------------------
20-
QUERY_TYPE=hybrid
20+
QUERY_TYPE=hybrid
21+
22+
# -------------------------
23+
# UNSTRUCTURED CONFIG
24+
# -------------------------
25+
UNSTRUCTURED_API_KEY={your-unstructured-api-key}

backend/CHANGELOG.md

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,18 +5,31 @@ All notable changes to this project will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8-
## [Unreleased]
8+
## [0.1.1] - 2024-10-08
99

1010
### Added
1111

1212
- Added git workflows
1313
- Added issue templates
14+
- Integrated Unstructured API for enhanced document processing
15+
- Optional dependency groups in pyproject.toml for flexible installation
16+
- New `unstructured_loader` function for processing documents with Unstructured API
17+
- Error handling for Unstructured API import and usage
18+
1419

1520
### Changed
1621

22+
- Updated `upload_document` function to use Unstructured API when available
23+
- Modified project structure to support optional Unstructured integration
24+
- Updated installation instructions in README to reflect new dependency options
25+
26+
27+
### Fixed
28+
1729
- Fixed issues for mypy, flake8, isort, black
30+
- Improved error handling in document processing pipeline
1831

19-
## [v0.1.0]
32+
## [0.1.0]
2033

2134
### Added
2235

backend/pyproject.toml

Lines changed: 23 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ build-backend = "setuptools.build_meta"
66
name = "knowledge-table-api"
77
authors = [{name = "Foo Bar"}]
88
description = "A Python API for the Knowledge Table service"
9-
keywords = ["knowlege", "api"]
9+
keywords = ["knowledge", "api"]
1010
classifiers = ["Programming Language :: Python :: 3"]
1111
requires-python = ">=3.10"
1212
dependencies = [
@@ -93,15 +93,20 @@ dependencies = [
9393
"typing_extensions==4.12.2",
9494
"tzdata==2024.1",
9595
"ujson==5.10.0",
96+
"unstructured",
9697
"urllib3==2.2.3",
9798
"uuid==1.30",
9899
"uvicorn==0.30.6",
99-
"whyhow==0.1.7",
100+
"whyhow",
100101
"yarl==1.11.1",
101102
]
102103
dynamic = ["version"]
103104

104105
[project.optional-dependencies]
106+
unstructured = [
107+
"unstructured[all-docs]",
108+
]
109+
105110
dev = [
106111
"bandit[toml]",
107112
"black",
@@ -117,6 +122,22 @@ dev = [
117122
"pytest-html",
118123
]
119124

125+
full = [
126+
"unstructured[all-docs]",
127+
"bandit[toml]",
128+
"black",
129+
"flake8",
130+
"flake8-docstrings",
131+
"isort",
132+
"mypy",
133+
"pydocstyle[toml]",
134+
"pytest-asyncio",
135+
"pytest-cov",
136+
"pytest-httpx",
137+
"pytest",
138+
"pytest-html",
139+
]
140+
120141
[project.scripts]
121142
knowledge-table-locate = "knowledge_table_api.main:locate"
122143

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
"""Knowledge Table API package."""
22

3-
__version__ = "v0.1.0"
3+
__version__ = "v0.1.1"

backend/src/knowledge_table_api/routers/document.py

Lines changed: 75 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
"""Document router."""
22

3-
from fastapi import APIRouter, File, UploadFile
3+
from typing import Dict
4+
5+
from fastapi import APIRouter, File, HTTPException, UploadFile, status
46

57
from knowledge_table_api.models.document import Document
68
from knowledge_table_api.services.document import upload_document
@@ -9,35 +11,91 @@
911
router = APIRouter(tags=["Document"], prefix="/document")
1012

1113

12-
@router.post("", response_model=Document)
14+
@router.post("", response_model=Document, status_code=status.HTTP_201_CREATED)
1315
async def upload_document_endpoint(
1416
file: UploadFile = File(...),
15-
) -> Document | dict[str, str]:
16-
"""Upload a document and process it."""
17+
) -> Document:
18+
"""
19+
Upload a document and process it.
20+
21+
Parameters
22+
----------
23+
file : UploadFile
24+
The file to be uploaded and processed.
25+
26+
Returns
27+
-------
28+
Document
29+
The processed document information.
30+
31+
Raises
32+
------
33+
HTTPException
34+
If the file name is missing or if an error occurs during processing.
35+
"""
1736
if file.filename is None:
18-
return {"message": "File name is missing"}
19-
document_id = await upload_document(
20-
file.content_type, file.filename, await file.read()
21-
)
37+
raise HTTPException(
38+
status_code=status.HTTP_400_BAD_REQUEST,
39+
detail="File name is missing",
40+
)
41+
42+
try:
43+
document_id = await upload_document(
44+
file.content_type, file.filename, await file.read()
45+
)
46+
except ValueError as ve:
47+
raise HTTPException(
48+
status_code=status.HTTP_400_BAD_REQUEST, detail=str(ve)
49+
)
50+
except Exception as e:
51+
raise HTTPException(
52+
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, detail=str(e)
53+
)
54+
2255
if document_id is None:
23-
return {"message": "An error occurred while processing the document"}
56+
raise HTTPException(
57+
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
58+
detail="An error occurred while processing the document",
59+
)
2460

61+
# TODO: Fetch actual document details from a database
2562
document = Document(
2663
id=document_id,
2764
name=file.filename,
28-
author="author_name",
29-
tag="document_tag",
30-
page_count=10,
65+
author="author_name", # TODO: Determine this dynamically
66+
tag="document_tag", # TODO: Determine this dynamically
67+
page_count=10, # TODO: Determine this dynamically
3168
)
3269
return document
3370

3471

35-
@router.delete("/{document_id}", response_model=dict)
36-
async def delete_document_endpoint(document_id: str) -> dict[str, str]:
37-
"""Delete a document."""
38-
delete_document_response = await delete_document(document_id)
72+
@router.delete("/{document_id}", response_model=Dict[str, str])
73+
async def delete_document_endpoint(document_id: str) -> Dict[str, str]:
74+
"""
75+
Delete a document.
76+
77+
Parameters
78+
----------
79+
document_id : str
80+
The ID of the document to be deleted.
81+
82+
Returns
83+
-------
84+
Dict[str, str]
85+
A dictionary containing the deletion status and message.
86+
87+
Raises
88+
------
89+
HTTPException
90+
If an error occurs during the deletion process.
91+
"""
92+
try:
93+
delete_document_response = await delete_document(document_id)
94+
except Exception as e:
95+
raise HTTPException(
96+
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, detail=str(e)
97+
)
3998

40-
# Delete the document
4199
return {
42100
"id": document_id,
43101
"status": delete_document_response["status"],

0 commit comments

Comments
 (0)