Split text entries by max tokens supported by ML models #105

debanjum · 2022-12-26T01:04:25Z

Background

There is a limit to the maximum input tokens (words) that an ML model can encode into an embedding vector.
For the models used for text search in khoj, a max token size of 256 words is appropriate 1,2

Issue

Until now entries exceeding max token size would silently get truncated during embedding generation.
So the truncated portion of the entries would be ignored when matching queries with entries
This would degrade the quality of the results

Fix

e057c8e Add method to split entries by specified max tokens limit
Split entries by max tokens while converting Org, Markdown and Beancount entries to JSONL
b283650 Deduplicate results for user query by raw text before returning results

Results

The quality of the search results should improve
Relevant, long entries should show up in results more often

- Issue ML Models truncate entries exceeding some max token limit. This lowers the quality of search results - Fix Split entries by max tokens before indexing. This should improve searching for content in longer entries. - Miscellaneous - Test method to split entries by max tokens

- Test usage the entry splitting by max tokens in text search

- Fix logger message when converting org node to entries - Remove unused import from conftest

- Required because entries are now split by the max_word count supported by the ML models - This would now result in potentially duplicate hits, entries being returned to user - Do deduplication after ranking to get the top ranked deduplicated results

…ug logs - Remove property drawer from test entry for max_words splitting test - Property drawer is not required for the test - Keep minimal test case to reduce chance for confusion

debanjum added 5 commits December 23, 2022 16:24

Split entries by max tokens while converting Org entries To JSONL

c79919b

- Test usage the entry splitting by max tokens in text search

Regenerate initial model in asymmetric reload test to reduce flakyness

53cd2e5

- Fix logger message when converting org node to entries - Remove unused import from conftest

Fix comments, use minimal test case, regenerate test index, merge deb…

24676f9

…ug logs - Remove property drawer from test entry for max_words splitting test - Property drawer is not required for the test - Keep minimal test case to reduce chance for confusion

debanjum force-pushed the chunk-entries-by-max-token-size branch from d5b340c to 24676f9 Compare December 26, 2022 01:38

debanjum added the fix Fix something that isn't working as expected label Dec 26, 2022

debanjum added this to the Release 0.3.0 milestone Dec 26, 2022

debanjum added upgrade New feature or request and removed fix Fix something that isn't working as expected labels Dec 26, 2022

debanjum added 2 commits December 26, 2022 13:14

Split entries by max tokens while converting Markdown entries To JSONL

f209e30

Split entries by max tokens while converting Beancount entries To JSONL

17fa123

debanjum changed the title ~~Split entries by max tokens supported by ML models~~ Split text entries by max tokens supported by ML models Dec 26, 2022

debanjum merged commit 06c2568 into master Dec 26, 2022

debanjum deleted the chunk-entries-by-max-token-size branch December 26, 2022 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split text entries by max tokens supported by ML models #105

Split text entries by max tokens supported by ML models #105

debanjum commented Dec 26, 2022 •

edited

Loading

Split text entries by max tokens supported by ML models #105

Split text entries by max tokens supported by ML models #105

Conversation

debanjum commented Dec 26, 2022 • edited Loading

Background

Issue

Fix

Results

debanjum commented Dec 26, 2022 •

edited

Loading