Skip to content

Commit a897e8b

Browse files
authored
Merge pull request #56 from whyhow-ai/55-feature-add-replace-rule
Add resolve entity rules
2 parents 4edb682 + 7fb0077 commit a897e8b

File tree

19 files changed

+750
-70
lines changed

19 files changed

+750
-70
lines changed

README.md

Lines changed: 16 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,8 @@ Our goal is to provide a familiar, spreadsheet-like interface for business users
1010

1111
For a limited demo, check out the [Knowledge Table Demo](https://knowledge-table-demo.whyhow.ai/).
1212

13-
1413
https://github.com/user-attachments/assets/8e0e5cc6-6468-4bb5-888c-6b552e15b58a
1514

16-
1715
To learn more about WhyHow and our projects, visit our [website](https://whyhow.ai/).
1816

1917
## Table of Contents
@@ -102,11 +100,13 @@ The frontend can be accessed at `http://localhost:3000`, and the backend can be
102100
4. **Install the dependencies:**
103101

104102
For basic installation:
103+
105104
```sh
106105
pip install .
107106
```
108107

109108
For installation with development tools:
109+
110110
```sh
111111
pip install .[dev]
112112
```
@@ -180,6 +180,7 @@ To set up the project for development:
180180
black .
181181
isort .
182182
```
183+
183184
---
184185

185186
## Features
@@ -189,12 +190,12 @@ To set up the project for development:
189190
- **Chunk Linking** - Link raw source text chunks to the answers for traceability and provenance.
190191
- **Extract with natural language** - Use natural language queries to extract structured data from unstructured documents.
191192
- **Customizable extraction rules** - Define rules to guide the extraction process and ensure data quality.
192-
- **Custom formatting** - Control the output format of your extracted data.
193+
- **Custom formatting** - Control the output format of your extracted data. Knowledge table current supports text, list of text, number, list of numbers, and boolean formats.
193194
- **Filtering** - Filter documents based on metadata or extracted data.
194195
- **Exporting as CSV or Triples** - Download extracted data as CSV or graph triples.
195196
- **Chained extraction** - Reference previous columns in your extraction questions using @ i.e. "What are the treatments for `@disease`?".
196197
- **Split Cell Into Rows** - Turn outputs within a single cell from List of Numbers or List of Values and split it into individual rows to do more complex Chained Extraction
197-
198+
198199
---
199200

200201
## Concepts
@@ -211,6 +212,15 @@ Each **document** is an unstructured data source (e.g., a contract, article, or
211212

212213
A **Question** is the core mechanism for guiding extraction. It defines what data you want to extract from a document.
213214

215+
### Rule
216+
217+
A **Rule** guides the extraction from the LLM. You can add rules on a column level or on a global level. Currently, the following rule types are supported:
218+
219+
- **May Return** rules give the LLM examples of answers that can be used to guide the extraction. This is a great way to give more guidance for the LLM on the type of things it should keep an eye out for.
220+
- **Must Return** rules give the LLM an exhaustive list of answers that are allowed to be returned. This is a great way to give guardrails for the LLM to ensure only certain terms are returned.
221+
- **Allowed # of Responses** rules are useful for provide guardrails in the event there are may be a range of potential ‘grey-area’ answers and we want to only restrict and guarantee only a certain number of the top responses are provided.
222+
- **Resolve Entity** rules allow you to resolve values to a specific entity. This is useful for ensuring output conforms to a specific entity type. For example, you can write rules that ensure "blackrock", "Blackrock, Inc.", and "Blackrock Corporation" all resolve to the same entity - "Blackrock".
223+
214224
---
215225

216226
## Practical Usage
@@ -225,6 +235,7 @@ Once you've set up your questions, rules, and documents, the Knowledge Table pro
225235
- **Metadata Generation**: Classify and tag information about your documents and files by running targeted questions against the files (i.e. "What project is this email thread about?")
226236

227237
---
238+
228239
## Export to Triples
229240

230241
To create the Schema for the Triples, we use an LLM to consider the Entity Type of the Column, the question that was used to generate the cells, and the values themselves, to create the schema and the triples. The document name is inserted as a node property. The vector chunk ids are also included in the JSON file of the triples, and tied to the triples created.
@@ -263,8 +274,7 @@ To use the Unstructured API integration:
263274

264275
When the `UNSTRUCTURED_API_KEY` is set, Knowledge Table will automatically use the Unstructured API for document processing. If the key is not set or if there's an issue with the Unstructured API, the system will fall back to the default document loaders.
265276

266-
Note: Usage of the Unstructured API may incur costs based on your plan with Unstructured.io.
267-
---
277+
## Note: Usage of the Unstructured API may incur costs based on your plan with Unstructured.io.
268278

269279
## Roadmap
270280

backend/CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1212
- Added support for queries without source data in vector database
1313
- Graceful failure of triple export when no chunks are found
1414
- Tested Qdrant vector database service
15+
- Added resolve entity rule
1516

1617
### Changed
1718

backend/src/app/api/v1/endpoints/query.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -128,8 +128,11 @@ async def run_query(
128128
answer=query_response.answer,
129129
type=request.prompt.type,
130130
)
131+
# Include resolved_entities in the response
131132
response_data = QueryAnswerResponse(
132-
answer=answer, chunks=query_response.chunks
133+
answer=answer,
134+
chunks=query_response.chunks,
135+
resolved_entities=query_response.resolved_entities, # Add this line
133136
)
134137

135138
return response_data

backend/src/app/models/query_core.py

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,33 @@
55
from pydantic import BaseModel
66

77

8+
class EntitySource(BaseModel):
9+
"""Entity source model."""
10+
11+
type: Literal["column", "global"]
12+
id: str
13+
14+
15+
class ResolvedEntity(BaseModel):
16+
"""Resolved entity model."""
17+
18+
original: Union[str, List[str]]
19+
resolved: Union[str, List[str]]
20+
source: EntitySource
21+
entityType: str
22+
23+
24+
class TransformationDict(BaseModel):
25+
"""Transformation dictionary model."""
26+
27+
original: Union[str, List[str]]
28+
resolved: Union[str, List[str]]
29+
30+
831
class Rule(BaseModel):
932
"""Rule model."""
1033

11-
type: Literal["must_return", "may_return", "max_length"]
34+
type: Literal["must_return", "may_return", "max_length", "resolve_entity"]
1235
options: Optional[List[str]] = None
1336
length: Optional[int] = None
1437

backend/src/app/schemas/query_api.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,15 @@
77
from app.models.query_core import Chunk, FormatType, Rule
88

99

10+
class ResolvedEntitySchema(BaseModel):
11+
"""Schema for resolved entity transformations."""
12+
13+
original: Union[str, List[str]]
14+
resolved: Union[str, List[str]]
15+
source: dict[str, str]
16+
entityType: str
17+
18+
1019
class QueryPromptSchema(BaseModel):
1120
"""Schema for the prompt part of the query request."""
1221

@@ -39,6 +48,7 @@ class QueryResult(BaseModel):
3948

4049
answer: Any
4150
chunks: List[Chunk]
51+
resolved_entities: Optional[List[ResolvedEntitySchema]] = None
4252

4353

4454
class QueryResponseSchema(BaseModel):
@@ -50,6 +60,7 @@ class QueryResponseSchema(BaseModel):
5060
answer: Optional[Any] = None
5161
chunks: List[Chunk]
5262
type: str
63+
resolved_entities: Optional[List[ResolvedEntitySchema]] = None
5364

5465

5566
class QueryAnswer(BaseModel):
@@ -67,6 +78,7 @@ class QueryAnswerResponse(BaseModel):
6778

6879
answer: QueryAnswer
6980
chunks: List[Chunk]
81+
resolved_entities: Optional[List[ResolvedEntitySchema]] = None
7082

7183

7284
# Type for search responses (used in service layer)

backend/src/app/services/query_service.py

Lines changed: 147 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,15 @@
11
"""Query service."""
22

33
import logging
4-
from typing import Any, Awaitable, Callable, List
4+
import re
5+
from typing import Any, Awaitable, Callable, Dict, List, Union
56

67
from app.models.query_core import Chunk, FormatType, QueryType, Rule
7-
from app.schemas.query_api import QueryResult, SearchResponse
8+
from app.schemas.query_api import (
9+
QueryResult,
10+
ResolvedEntitySchema,
11+
SearchResponse,
12+
)
813
from app.services.llm_service import (
914
CompletionService,
1015
generate_inferred_response,
@@ -38,6 +43,65 @@ def extract_chunks(search_response: SearchResponse) -> List[Chunk]:
3843
)
3944

4045

46+
def replace_keywords(
47+
text: Union[str, List[str]], keyword_replacements: Dict[str, str]
48+
) -> tuple[
49+
Union[str, List[str]], Dict[str, Union[str, List[str]]]
50+
]: # Changed return type
51+
"""Replace keywords in text and return both the modified text and transformation details."""
52+
if not text or not keyword_replacements:
53+
return text, {
54+
"original": text,
55+
"resolved": text,
56+
} # Return dict instead of TransformationDict
57+
58+
# Handle list of strings
59+
if isinstance(text, list):
60+
original_text = text.copy()
61+
result = []
62+
modified = False
63+
64+
# Create a single regex pattern for all keywords
65+
pattern = "|".join(map(re.escape, keyword_replacements.keys()))
66+
regex = re.compile(f"\\b({pattern})\\b")
67+
68+
for item in text:
69+
# Single pass replacement for all keywords
70+
new_item = regex.sub(
71+
lambda m: keyword_replacements[m.group()], item
72+
)
73+
result.append(new_item)
74+
if new_item != item:
75+
modified = True
76+
77+
if modified:
78+
return result, {"original": original_text, "resolved": result}
79+
return result, {"original": original_text, "resolved": result}
80+
81+
# Handle single string
82+
return replace_keywords_in_string(text, keyword_replacements)
83+
84+
85+
def replace_keywords_in_string(
86+
text: str, keyword_replacements: Dict[str, str]
87+
) -> tuple[str, Dict[str, Union[str, List[str]]]]: # Changed return type
88+
"""Keywords for single string."""
89+
if not text:
90+
return text, {"original": text, "resolved": text}
91+
92+
# Create a single regex pattern for all keywords
93+
pattern = "|".join(map(re.escape, keyword_replacements.keys()))
94+
regex = re.compile(f"\\b({pattern})\\b")
95+
96+
# Single pass replacement
97+
result = regex.sub(lambda m: keyword_replacements[m.group()], text)
98+
99+
# Only return transformation if something changed
100+
if result != text:
101+
return result, {"original": text, "resolved": result}
102+
return text, {"original": text, "resolved": text}
103+
104+
41105
async def process_query(
42106
query_type: QueryType,
43107
query: str,
@@ -59,14 +123,69 @@ async def process_query(
59123
)
60124
answer_value = answer["answer"]
61125

62-
result_chunks = (
63-
[]
64-
if answer_value in ("not found", None)
65-
and query_type != "decomposition"
66-
else chunks
67-
)
126+
transformations: Dict[str, Union[str, List[str]]] = {
127+
"original": "",
128+
"resolved": "",
129+
}
130+
131+
result_chunks = []
132+
133+
if format in ["str", "str_array"]:
134+
135+
# Extract and apply keyword replacements from all resolve_entity rules
136+
resolve_entity_rules = [
137+
rule for rule in rules if rule.type == "resolve_entity"
138+
]
139+
140+
result_chunks = (
141+
[]
142+
if answer_value in ("not found", None)
143+
and query_type != "decomposition"
144+
else chunks
145+
)
68146

69-
return QueryResult(answer=answer_value, chunks=result_chunks[:10])
147+
# First populate the replacements dictionary
148+
replacements: Dict[str, str] = {}
149+
if resolve_entity_rules and answer_value:
150+
for rule in resolve_entity_rules:
151+
if rule.options:
152+
rule_replacements = dict(
153+
option.split(":") for option in rule.options
154+
)
155+
replacements.update(rule_replacements)
156+
157+
# Then apply the replacements if we have any
158+
if replacements:
159+
print(f"Resolving entities in answer: {answer_value}")
160+
if isinstance(answer_value, list):
161+
transformed_list, transform_dict = replace_keywords(
162+
answer_value, replacements
163+
)
164+
transformations = transform_dict
165+
answer_value = transformed_list
166+
else:
167+
transformed_value, transform_dict = replace_keywords(
168+
answer_value, replacements
169+
)
170+
transformations = transform_dict
171+
answer_value = transformed_value
172+
173+
return QueryResult(
174+
answer=answer_value,
175+
chunks=result_chunks[:10],
176+
resolved_entities=(
177+
[
178+
ResolvedEntitySchema(
179+
original=transformations["original"],
180+
resolved=transformations["resolved"],
181+
source={"type": "column", "id": "some-id"},
182+
entityType="some-type",
183+
)
184+
]
185+
if transformations["original"] or transformations["resolved"]
186+
else None
187+
),
188+
)
70189

71190

72191
# Convenience functions for specific query types
@@ -145,4 +264,23 @@ async def inference_query(
145264
)
146265
answer_value = answer["answer"]
147266

267+
# Extract and apply keyword replacements from all resolve_entity rules
268+
resolve_entity_rules = [
269+
rule for rule in rules if rule.type == "resolve_entity"
270+
]
271+
272+
if resolve_entity_rules and answer_value:
273+
# Combine all replacements from all resolve_entity rules
274+
replacements = {}
275+
for rule in resolve_entity_rules:
276+
if rule.options:
277+
rule_replacements = dict(
278+
option.split(":") for option in rule.options
279+
)
280+
replacements.update(rule_replacements)
281+
282+
if replacements:
283+
print(f"Resolving entities in answer: {answer_value}")
284+
answer_value = replace_keywords(answer_value, replacements)
285+
148286
return QueryResult(answer=answer_value, chunks=[])

0 commit comments

Comments
 (0)