whyhow-ai
diff --git a/‎README.md
Lines changed: 16 additions & 6 deletions b/‎README.md
Lines changed: 16 additions & 6 deletions
diff --git a/‎backend/CHANGELOG.md
Lines changed: 1 addition & 0 deletions b/‎backend/CHANGELOG.md
Lines changed: 1 addition & 0 deletions
diff --git a/‎backend/src/app/api/v1/endpoints/query.py
Lines changed: 4 additions & 1 deletion b/‎backend/src/app/api/v1/endpoints/query.py
Lines changed: 4 additions & 1 deletion
diff --git a/‎backend/src/app/models/query_core.py
Lines changed: 24 additions & 1 deletion b/‎backend/src/app/models/query_core.py
Lines changed: 24 additions & 1 deletion
diff --git a/‎backend/src/app/schemas/query_api.py
Lines changed: 12 additions & 0 deletions b/‎backend/src/app/schemas/query_api.py
Lines changed: 12 additions & 0 deletions
diff --git a/‎backend/src/app/services/query_service.py
Lines changed: 147 additions & 9 deletions b/‎backend/src/app/services/query_service.py
Lines changed: 147 additions & 9 deletions
@@ -10,10 +10,8 @@ Our goal is to provide a familiar, spreadsheet-like interface for business users
 
 For a limited demo, check out the [Knowledge Table Demo](https://knowledge-table-demo.whyhow.ai/).
 
-
 https://github.com/user-attachments/assets/8e0e5cc6-6468-4bb5-888c-6b552e15b58a
 
-
 To learn more about WhyHow and our projects, visit our [website](https://whyhow.ai/).
 
 ## Table of Contents
@@ -102,11 +100,13 @@ The frontend can be accessed at `http://localhost:3000`, and the backend can be
 4. **Install the dependencies:**
 
    For basic installation:
+
    ```sh
    pip install .
    ```
 
    For installation with development tools:
+
    ```sh
    pip install .[dev]
    ```
@@ -180,6 +180,7 @@ To set up the project for development:
    black .
    isort .
    ```
+
 ---
 
 ## Features
@@ -189,12 +190,12 @@ To set up the project for development:
 - **Chunk Linking** - Link raw source text chunks to the answers for traceability and provenance.
 - **Extract with natural language** - Use natural language queries to extract structured data from unstructured documents.
 - **Customizable extraction rules** - Define rules to guide the extraction process and ensure data quality.
-- **Custom formatting** - Control the output format of your extracted data.
+- **Custom formatting** - Control the output format of your extracted data. Knowledge table current supports text, list of text, number, list of numbers, and boolean formats.
 - **Filtering** - Filter documents based on metadata or extracted data.
 - **Exporting as CSV or Triples** - Download extracted data as CSV or graph triples.
 - **Chained extraction** - Reference previous columns in your extraction questions using @ i.e. "What are the treatments for `@disease`?".
 - **Split Cell Into Rows** - Turn outputs within a single cell from List of Numbers or List of Values and split it into individual rows to do more complex Chained Extraction
- 
+
 ---
 
 ## Concepts
@@ -211,6 +212,15 @@ Each **document** is an unstructured data source (e.g., a contract, article, or
 
 A **Question** is the core mechanism for guiding extraction. It defines what data you want to extract from a document.
 
+### Rule
+
+A **Rule** guides the extraction from the LLM. You can add rules on a column level or on a global level. Currently, the following rule types are supported:
+
+- **May Return** rules give the LLM examples of answers that can be used to guide the extraction. This is a great way to give more guidance for the LLM on the type of things it should keep an eye out for.
+- **Must Return** rules give the LLM an exhaustive list of answers that are allowed to be returned. This is a great way to give guardrails for the LLM to ensure only certain terms are returned.
+- **Allowed # of Responses** rules are useful for provide guardrails in the event there are may be a range of potential ‘grey-area’ answers and we want to only restrict and guarantee only a certain number of the top responses are provided.
+- **Resolve Entity** rules allow you to resolve values to a specific entity. This is useful for ensuring output conforms to a specific entity type. For example, you can write rules that ensure "blackrock", "Blackrock, Inc.", and "Blackrock Corporation" all resolve to the same entity - "Blackrock".
+
 ---
 
 ## Practical Usage
@@ -225,6 +235,7 @@ Once you've set up your questions, rules, and documents, the Knowledge Table pro
 - **Metadata Generation**: Classify and tag information about your documents and files by running targeted questions against the files (i.e. "What project is this email thread about?")
 
 ---
+
 ## Export to Triples
 
 To create the Schema for the Triples, we use an LLM to consider the Entity Type of the Column, the question that was used to generate the cells, and the values themselves, to create the schema and the triples. The document name is inserted as a node property. The vector chunk ids are also included in the JSON file of the triples, and tied to the triples created.
@@ -263,8 +274,7 @@ To use the Unstructured API integration:
 
 When the `UNSTRUCTURED_API_KEY` is set, Knowledge Table will automatically use the Unstructured API for document processing. If the key is not set or if there's an issue with the Unstructured API, the system will fall back to the default document loaders.
 
-Note: Usage of the Unstructured API may incur costs based on your plan with Unstructured.io.
----
+## Note: Usage of the Unstructured API may incur costs based on your plan with Unstructured.io.
 
 ## Roadmap
 
 
@@ -12,6 +12,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Added support for queries without source data in vector database
 - Graceful failure of triple export when no chunks are found
 - Tested Qdrant vector database service
+- Added resolve entity rule
 
 ### Changed
 
 
@@ -128,8 +128,11 @@ async def run_query(
             answer=query_response.answer,
             type=request.prompt.type,
         )
+        # Include resolved_entities in the response
         response_data = QueryAnswerResponse(
-            answer=answer, chunks=query_response.chunks
+            answer=answer,
+            chunks=query_response.chunks,
+            resolved_entities=query_response.resolved_entities,  # Add this line
         )
 
         return response_data
 
@@ -5,10 +5,33 @@
 from pydantic import BaseModel
 
 
+class EntitySource(BaseModel):
+    """Entity source model."""
+
+    type: Literal["column", "global"]
+    id: str
+
+
+class ResolvedEntity(BaseModel):
+    """Resolved entity model."""
+
+    original: Union[str, List[str]]
+    resolved: Union[str, List[str]]
+    source: EntitySource
+    entityType: str
+
+
+class TransformationDict(BaseModel):
+    """Transformation dictionary model."""
+
+    original: Union[str, List[str]]
+    resolved: Union[str, List[str]]
+
+
 class Rule(BaseModel):
     """Rule model."""
 
-    type: Literal["must_return", "may_return", "max_length"]
+    type: Literal["must_return", "may_return", "max_length", "resolve_entity"]
     options: Optional[List[str]] = None
     length: Optional[int] = None
 
 
@@ -7,6 +7,15 @@
 from app.models.query_core import Chunk, FormatType, Rule
 
 
+class ResolvedEntitySchema(BaseModel):
+    """Schema for resolved entity transformations."""
+
+    original: Union[str, List[str]]
+    resolved: Union[str, List[str]]
+    source: dict[str, str]
+    entityType: str
+
+
 class QueryPromptSchema(BaseModel):
     """Schema for the prompt part of the query request."""
 
@@ -39,6 +48,7 @@ class QueryResult(BaseModel):
 
     answer: Any
     chunks: List[Chunk]
+    resolved_entities: Optional[List[ResolvedEntitySchema]] = None
 
 
 class QueryResponseSchema(BaseModel):
@@ -50,6 +60,7 @@ class QueryResponseSchema(BaseModel):
     answer: Optional[Any] = None
     chunks: List[Chunk]
     type: str
+    resolved_entities: Optional[List[ResolvedEntitySchema]] = None
 
 
 class QueryAnswer(BaseModel):
@@ -67,6 +78,7 @@ class QueryAnswerResponse(BaseModel):
 
     answer: QueryAnswer
     chunks: List[Chunk]
+    resolved_entities: Optional[List[ResolvedEntitySchema]] = None
 
 
 # Type for search responses (used in service layer)
 
@@ -1,10 +1,15 @@
 """Query service."""
 
 import logging
-from typing import Any, Awaitable, Callable, List
+import re
+from typing import Any, Awaitable, Callable, Dict, List, Union
 
 from app.models.query_core import Chunk, FormatType, QueryType, Rule
-from app.schemas.query_api import QueryResult, SearchResponse
+from app.schemas.query_api import (
+    QueryResult,
+    ResolvedEntitySchema,
+    SearchResponse,
+)
 from app.services.llm_service import (
     CompletionService,
     generate_inferred_response,
@@ -38,6 +43,65 @@ def extract_chunks(search_response: SearchResponse) -> List[Chunk]:
     )
 
 
+def replace_keywords(
+    text: Union[str, List[str]], keyword_replacements: Dict[str, str]
+) -> tuple[
+    Union[str, List[str]], Dict[str, Union[str, List[str]]]
+]:  # Changed return type
+    """Replace keywords in text and return both the modified text and transformation details."""
+    if not text or not keyword_replacements:
+        return text, {
+            "original": text,
+            "resolved": text,
+        }  # Return dict instead of TransformationDict
+
+    # Handle list of strings
+    if isinstance(text, list):
+        original_text = text.copy()
+        result = []
+        modified = False
+
+        # Create a single regex pattern for all keywords
+        pattern = "|".join(map(re.escape, keyword_replacements.keys()))
+        regex = re.compile(f"\\b({pattern})\\b")
+
+        for item in text:
+            # Single pass replacement for all keywords
+            new_item = regex.sub(
+                lambda m: keyword_replacements[m.group()], item
+            )
+            result.append(new_item)
+            if new_item != item:
+                modified = True
+
+        if modified:
+            return result, {"original": original_text, "resolved": result}
+        return result, {"original": original_text, "resolved": result}
+
+    # Handle single string
+    return replace_keywords_in_string(text, keyword_replacements)
+
+
+def replace_keywords_in_string(
+    text: str, keyword_replacements: Dict[str, str]
+) -> tuple[str, Dict[str, Union[str, List[str]]]]:  # Changed return type
+    """Keywords for single string."""
+    if not text:
+        return text, {"original": text, "resolved": text}
+
+    # Create a single regex pattern for all keywords
+    pattern = "|".join(map(re.escape, keyword_replacements.keys()))
+    regex = re.compile(f"\\b({pattern})\\b")
+
+    # Single pass replacement
+    result = regex.sub(lambda m: keyword_replacements[m.group()], text)
+
+    # Only return transformation if something changed
+    if result != text:
+        return result, {"original": text, "resolved": result}
+    return text, {"original": text, "resolved": text}
+
+
 async def process_query(
     query_type: QueryType,
     query: str,
@@ -59,14 +123,69 @@ async def process_query(
     )
     answer_value = answer["answer"]
 
-    result_chunks = (
-        []
-        if answer_value in ("not found", None)
-        and query_type != "decomposition"
-        else chunks
-    )
+    transformations: Dict[str, Union[str, List[str]]] = {
+        "original": "",
+        "resolved": "",
+    }
+
+    result_chunks = []
+
+    if format in ["str", "str_array"]:
+
+        # Extract and apply keyword replacements from all resolve_entity rules
+        resolve_entity_rules = [
+            rule for rule in rules if rule.type == "resolve_entity"
+        ]
+
+        result_chunks = (
+            []
+            if answer_value in ("not found", None)
+            and query_type != "decomposition"
+            else chunks
+        )
 
-    return QueryResult(answer=answer_value, chunks=result_chunks[:10])
+        # First populate the replacements dictionary
+        replacements: Dict[str, str] = {}
+        if resolve_entity_rules and answer_value:
+            for rule in resolve_entity_rules:
+                if rule.options:
+                    rule_replacements = dict(
+                        option.split(":") for option in rule.options
+                    )
+                    replacements.update(rule_replacements)
+
+            # Then apply the replacements if we have any
+            if replacements:
+                print(f"Resolving entities in answer: {answer_value}")
+                if isinstance(answer_value, list):
+                    transformed_list, transform_dict = replace_keywords(
+                        answer_value, replacements
+                    )
+                    transformations = transform_dict
+                    answer_value = transformed_list
+                else:
+                    transformed_value, transform_dict = replace_keywords(
+                        answer_value, replacements
+                    )
+                    transformations = transform_dict
+                    answer_value = transformed_value
+
+    return QueryResult(
+        answer=answer_value,
+        chunks=result_chunks[:10],
+        resolved_entities=(
+            [
+                ResolvedEntitySchema(
+                    original=transformations["original"],
+                    resolved=transformations["resolved"],
+                    source={"type": "column", "id": "some-id"},
+                    entityType="some-type",
+                )
+            ]
+            if transformations["original"] or transformations["resolved"]
+            else None
+        ),
+    )
 
 
 # Convenience functions for specific query types
@@ -145,4 +264,23 @@ async def inference_query(
     )
     answer_value = answer["answer"]
 
+    # Extract and apply keyword replacements from all resolve_entity rules
+    resolve_entity_rules = [
+        rule for rule in rules if rule.type == "resolve_entity"
+    ]
+
+    if resolve_entity_rules and answer_value:
+        # Combine all replacements from all resolve_entity rules
+        replacements = {}
+        for rule in resolve_entity_rules:
+            if rule.options:
+                rule_replacements = dict(
+                    option.split(":") for option in rule.options
+                )
+                replacements.update(rule_replacements)
+
+        if replacements:
+            print(f"Resolving entities in answer: {answer_value}")
+            answer_value = replace_keywords(answer_value, replacements)
+
     return QueryResult(answer=answer_value, chunks=[])
Original file line number	Diff line number	Diff line change
`@@ -128,8 +128,11 @@ async def run_query(`
`128`	`128`	`answer=query_response.answer,`
`129`	`129`	`type=request.prompt.type,`
`130`	`130`	`)`
	`131`	`+ # Include resolved_entities in the response`
`131`	`132`	`response_data = QueryAnswerResponse(`
`132`		`- answer=answer, chunks=query_response.chunks`
	`133`	`+ answer=answer,`
	`134`	`+ chunks=query_response.chunks,`
	`135`	`+ resolved_entities=query_response.resolved_entities, # Add this line`
`133`	`136`	`)`
`134`	`137`
`135`	`138`	`return response_data`