Skip to content

Enable Creating Content Plugins #173

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Mar 1, 2023
Merged

Conversation

debanjum
Copy link
Member

@debanjum debanjum commented Feb 27, 2023

Goal

Index, Search text content not supported by default in Khoj using plugins

Usage

  1. Create content plugin. It should render the source content into standardized JSONL format expected by Khoj
    • JSONL is just a list of json objects separated by newline
    • For khoj, each json line contains data about an entry to be indexed. E.g a markdown heading
    • Each json object needs raw, compiled and file keys.
      • The raw value is rendered as the search result in the interfaces
      • The compiled values are used to generate embeddings for search
      • The file key can be used to jump to the result in its source file from the interfaces
  2. Configure khoj.yml to point to the location of the plugin generated jsonl via the input-files, input-filter fields. See tests/data/config.yml. This is done on user machine either via code or by user manually updating their khoj.yml file
    @@ -6,6 +6,17 @@ content-type:
        embeddings-file: ".note_embeddings.pt"
        index-header-entries: true
    
    +  plugins:
    +    content_plugin_1:
    +      input-files: [ "content_plugin_1_new.jsonl" ]
    +      compressed-jsonl: "content_plugin_1.jsonl.gz"
    +      embeddings-file: "content_plugin_1_embeddings.pt"
    +
    +    content_plugin_2:
    +      input-filter: [ "*2_new.jsonl" ]
    +      compressed-jsonl: "content_plugin_2.jsonl.gz"
    +      embeddings-file: "content_plugin_2_embeddings.pt"
    
    search-type:
      asymmetric:
        encoder: "sentence-transformers/msmarco-MiniLM-L-6-v3"
  3. Index plugin content by triggering update from interfaces, (re-)starting khoj server or calling /api/update?t=<plugin-name> API endpoint
  4. Just set search type to the plugin via interface menu before searching via the web or Emacs interface as usual

Code Changes

  • fcbbe8c Configure content plugins to index using khoj.yml
  • Index content plugins from standardized JSONL format for ingestion
    • 55a032e Add jsonl processor to index plugin content
    • ab0d3a0 Index configured plugins on app start and via update API endpoint
  • Expose plugin content types for usage by interfaces
    • 47b58a2 Dynamically update available types on loading the Khoj server
    • Expose indexed types via API (9d38ead). Simplify getting enabled types in Web (f3f2438), Emacs (1e43f1a) interfaces
  • Search plugin content from the Web and Emacs Interfaces
    • d91c7e2 Search plugin content via the search API
    • Render plugin content on Web (88344f9) and Emacs (c2814fc) interfaces
      • The Web, Emacs interfaces are general interfaces, they allow searching across all content types
      • The Obsidian interface is currently tuned for only markdown content
        It will be extended to render more content plugins later

Testing

  • fcbbe8c Add unit tests to test reading plugin config from khoj.yml
  • 55a032e Add unit tests for the JsonlToJsonl processor
  • 88a9ead Add unit tests to validate search, incremental update, force-update API works with plugin content types
  • b09350c Add unit test to validate only configure search types returned by the new /api/config/types API endpoint
  • Manually test the config read, indexing, search and update with local khoj

Configure external text content plugins via the Khoj YAML
Reuse existing TextContentConfig definition for external text content plugins
- Read, merge entries from input jsonl files and filters
- Mark new, modified entries for update
@debanjum debanjum requested a review from sabaimran February 27, 2023 20:31
@debanjum debanjum added upgrade New feature or request coverage Add content type to search and index labels Feb 27, 2023
@debanjum debanjum force-pushed the enable-creating-content-plugins branch from 021cbb6 to 368f475 Compare March 1, 2023 00:19
The SearchType is now dynamically populated with core and configured
plugin types

Use the new dynamic SearchType enum from state.py across codebase
Configure app routes after configuring server.
Import API routers after search type is dynamically populated.
Allow API to recognize the dynamically populated plugin search types
as valid type query param.
Enable searching for plugin type content.
Simplifies dynamically populating enabled content types for interfaces
Render only the entry from plugin search response instead of raw json
…face

Render only the entry from plugin search response instead of raw json
Use the results-ledger styling for results-plugin styling
It may have been disabled due to issues with image search earlier
@debanjum debanjum force-pushed the enable-creating-content-plugins branch from 79aba96 to ede6eb6 Compare March 1, 2023 02:26
@debanjum debanjum linked an issue Mar 1, 2023 that may be closed by this pull request
@debanjum
Copy link
Member Author

debanjum commented Mar 1, 2023

  • The khoj.el tests are failing because of a bug in an external workflow dependency. This can be ignored for now
  • The publish to test pypi workflow is failing for unrelated reasons auth issues. This too can be ignored for now

I'll fix (or disable) these workflows separately

@sabaimran
Copy link
Member

Looks great! Looking forward to migrating Panchayat to using the plugin format.

- Remove need for interfaces to downcase content types returned by API
  before using the type in search and other API endpoint
- Fix to check for search_type.name in plugin keys instead of value
- Previously was return all core content types even if they had not been
  setup
- Add test to validate only configured content types are returned by
  the api/config/types API endpoint
@debanjum debanjum merged commit e77a5ff into master Mar 1, 2023
@debanjum debanjum deleted the enable-creating-content-plugins branch March 1, 2023 04:25
debanjum added a commit that referenced this pull request Mar 1, 2023
…gured

- Test /config/types API when no plugin configured, only plugin configured
  and no content configured scenarios
- Do not throw null reference exception while configuring search types
  when no plugin configured
- Do not throw null reference exception on calling /config/types API
  when no plugin configured

Resolves bug introduced by #173
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
coverage Add content type to search and index upgrade New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Content Plugins in Khoj
2 participants