Name		Name	Last commit message	Last commit date
parent directory ..
data		data
flows		flows
pods		pods
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
manifest.yml		manifest.yml
requirements.txt		requirements.txt

README.md

Add Incremental Indexing to Wikipedia Search

Input	2 text files with 1 sentence per line
Output	top_k number of sentences that match input query

This is an example of using Jina's neural search framework to add incremental indexing to our Wikipedia sentence search example

Prerequisites

Run and understand our Wikipedia sentence search example

What is incremental indexing?

Briefly, this lets a user re-use an index to add new data. It also automatically adds documents with duplicate IDs. For a more in-depth explanation on how Jina handles this, refer to our documentation.

Configuration changes

In order to adapt the Wikipedia sentence search example to support incremental indexing, we need to:

Edit Flows

Change indexer entry in flows/index.yml and flows/query.yml to use DocCache as a filter (in the uses_before field). This ensures that we prevent duplicates.

In this example the DocCache is separated into its own .yml file, in pods/index_cache.yml:

!DocCache
with:
  index_path: cache.tmp
metas:
  name: cache
  workspace: $JINA_WORKSPACE
requests:
  on:
    [SearchRequest, TrainRequest, IndexRequest, ControlRequest]:
      - !RouteDriver {}
    IndexRequest:
      - !TaggingCacheDriver
        with:
          tags:
            is_indexed: true
      - !FilterQL
        with:
          lookups: {tags__is_indexed__neq: true}

This might look complicated, but it basically first checks the cache for any matching doc IDs before indexing and querying.

Adapt Dataset

We split the dataset into two files: input-1.txt and input-2.txt. This way we can index one, close the Flow object, and then index the other.

The environment variable JINA_DATA_FILE has also been split, with JINA_DATA_FILE_1 and JINA_DATA_FILE_2 pointing to the two files above.

Adapt `app.py`

Adapt app.py to index, close, and index again.

When running python app.py -t index we would usually only index one file. We now have the following:

f = Flow().load_config("flows/index.yml")

with f:
    print(f'Indexing file {os.environ["JINA_DATA_FILE_1"]}')
    f.index_lines(
    ...
    )

# we then re-use the same index to append new data
with f:
    print(f'Indexing file {os.environ["JINA_DATA_FILE_2"]}')
    f.index_lines(
       ...
    )

This indexes the first file, closes the flow, and then indexes a second file.

Run in Docker

To test this example you can run a Docker image which will index both data files then enter query mode:

docker run -p 45678:45678 jinahub/app.example.wikipedia-sentences-incremental:0.1-0.9.24

You can then query by running:

curl --request POST -d '{"top_k": 10, "mode": "search",  "data": ["text:hello world"]}' -H 'Content-Type: application/json' 'http://0.0.0.0:45678/api/search'`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wikipedia-sentences-incremental

wikipedia-sentences-incremental

README.md

Add Incremental Indexing to Wikipedia Search

Prerequisites

What is incremental indexing?

Configuration changes

Edit Flows

Adapt Dataset

Adapt `app.py`

Run in Docker

Files

wikipedia-sentences-incremental

Directory actions

More options

Directory actions

More options

Latest commit

History

wikipedia-sentences-incremental

Folders and files

parent directory

README.md

Add Incremental Indexing to Wikipedia Search

Prerequisites

What is incremental indexing?

Configuration changes

Edit Flows

Adapt Dataset

Adapt app.py

Run in Docker

Adapt `app.py`