Input | 2 text files with 1 sentence per line |
Output | top_k number of sentences that match input query |
This is an example of using Jina's neural search framework to add incremental indexing to our Wikipedia sentence search example
- Run and understand our Wikipedia sentence search example
Briefly, this lets a user re-use an index to add new data. It also automatically adds documents with duplicate IDs. For a more in-depth explanation on how Jina handles this, refer to our documentation.
In order to adapt the Wikipedia sentence search example to support incremental indexing, we need to:
Change indexer
entry in flows/index.yml
and flows/query.yml
to use DocCache
as a filter (in the uses_before
field). This ensures that we prevent duplicates.
In this example the DocCache
is separated into its own .yml
file, in pods/index_cache.yml
:
!DocCache
with:
index_path: cache.tmp
metas:
name: cache
workspace: $JINA_WORKSPACE
requests:
on:
[SearchRequest, TrainRequest, IndexRequest, ControlRequest]:
- !RouteDriver {}
IndexRequest:
- !TaggingCacheDriver
with:
tags:
is_indexed: true
- !FilterQL
with:
lookups: {tags__is_indexed__neq: true}
This might look complicated, but it basically first checks the cache for any matching doc IDs before indexing and querying.
We split the dataset into two files: input-1.txt
and input-2.txt
. This way we can index one, close the Flow
object, and then index the other.
The environment variable JINA_DATA_FILE
has also been split, with JINA_DATA_FILE_1
and JINA_DATA_FILE_2
pointing to the two files above.
Adapt app.py
to index, close, and index again.
When running python app.py -t index
we would usually only index one file. We now have the following:
f = Flow().load_config("flows/index.yml")
with f:
print(f'Indexing file {os.environ["JINA_DATA_FILE_1"]}')
f.index_lines(
...
)
# we then re-use the same index to append new data
with f:
print(f'Indexing file {os.environ["JINA_DATA_FILE_2"]}')
f.index_lines(
...
)
This indexes the first file, closes the flow, and then indexes a second file.
To test this example you can run a Docker image which will index both data files then enter query mode:
docker run -p 45678:45678 jinahub/app.example.wikipedia-sentences-incremental:0.1-0.9.24
You can then query by running:
curl --request POST -d '{"top_k": 10, "mode": "search", "data": ["text:hello world"]}' -H 'Content-Type: application/json' 'http://0.0.0.0:45678/api/search'`