Skip to content

Split text entries by max tokens supported by ML models #105

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Dec 26, 2022

Conversation

debanjum
Copy link
Member

@debanjum debanjum commented Dec 26, 2022

Background

There is a limit to the maximum input tokens (words) that an ML model can encode into an embedding vector.
For the models used for text search in khoj, a max token size of 256 words is appropriate 1,2

Issue

Until now entries exceeding max token size would silently get truncated during embedding generation.
So the truncated portion of the entries would be ignored when matching queries with entries
This would degrade the quality of the results

Fix

  • e057c8e Add method to split entries by specified max tokens limit
  • Split entries by max tokens while converting Org, Markdown and Beancount entries to JSONL
  • b283650 Deduplicate results for user query by raw text before returning results

Results

  • The quality of the search results should improve
  • Relevant, long entries should show up in results more often

- Issue
   ML Models truncate entries exceeding some max token limit.
   This lowers the quality of search results

- Fix
  Split entries by max tokens before indexing.
  This should improve searching for content in longer entries.

- Miscellaneous
  - Test method to split entries by max tokens
- Test usage the entry splitting by max tokens in text search
- Fix logger message when converting org node to entries
- Remove unused import from conftest
- Required because entries are now split by the max_word count supported
  by the ML models
- This would now result in potentially duplicate hits, entries being
  returned to user
- Do deduplication after ranking to get the top ranked deduplicated
  results
…ug logs

- Remove property drawer from test entry for max_words splitting test
  - Property drawer is not required for the test
  - Keep minimal test case to reduce chance for confusion
@debanjum debanjum force-pushed the chunk-entries-by-max-token-size branch from d5b340c to 24676f9 Compare December 26, 2022 01:38
@debanjum debanjum added the fix Fix something that isn't working as expected label Dec 26, 2022
@debanjum debanjum added this to the Release 0.3.0 milestone Dec 26, 2022
@debanjum debanjum added upgrade New feature or request and removed fix Fix something that isn't working as expected labels Dec 26, 2022
@debanjum debanjum changed the title Split entries by max tokens supported by ML models Split text entries by max tokens supported by ML models Dec 26, 2022
@debanjum debanjum merged commit 06c2568 into master Dec 26, 2022
@debanjum debanjum deleted the chunk-entries-by-max-token-size branch December 26, 2022 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upgrade New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant