In this demo we will build a vector search engine with Jina. This means that given a dataset and some data to search in it, it will find the closest vector in it.
But not just that, let's imagine that after you indexed your data you want to use a different type of indexer now to query,
what you do then? You index your data each time for every type? No, no, don't worry, we will see how to use different
indexers as vector databases and how to use a ref_indexer
as a base indexer.
So you can have that as a base and then query with whatever type you wish.
Before moving forward, we highly suggest completing/reviewing our Jina 101 and Jina "Hello, World!"👋🌍 to make sure we are on the same page.
Table of Contents
- Requirements
- Prepare the data
- Define the Flows
- Run the Flows
- Dive into the
FaissIndexer
- Evaluate the results
- Wrap up
- Next Steps
- Documentation
- Community
- License
For this example you can use one of these 2 datasets ANN_SIFT10K or ANN_SIFT1M, which are datasets comprised of three vector sets:
- 10K or 1M to index
- 100 or 10K to query
- 25K or 100k to train
We will work with ANN_SIFT10K (siftsmall), that are SIFT descriptors for some image dataset. But if you wish you could use a larger datasets from the same source. A docker image is published where the ANN_SIFT1M (sift) has already been indexed using 4 shards.
And since we said we want to use different indexers to query, we will use Faiss and Annoy.
Another cool thing to have would be to be able to compare the results between those indexers, so we will also show how to evaluate ranking results with Faiss and Annoy, and add the search with NumpyIndexer (that uses exhaustive search, so it's close to 100% recall) to compare. But we encourage you to try different indexers and different options for other indexers to see what gets the best results and performance.
Let's start! First thing is to be sure we have all the requirements, so we can run:
pip install -r requirements.txt
And to make things easy will use the docker images for Annoy and Faiss, so make sure to pull the from the Jina Hub repository
Now let's get some data. We have prepared a small script that will download it
./get_data.sh siftsmall
Cool we have the data now, but FAISS needs to learn some patterns of the data in order to build an efficient indexing scheme. A.K.A we still need the training data, which is done with some subset of data that is not necessarily part of the index.So you need to run this script that will generate a workspace folder where the training data will be stored.
./generate_training_data.sh
This workspace folder will contain the built index once the vectors are indexed and will be mapped to the docker image.
Finally we're done getting all the prerequisites, we can index our data now!
To index the data we will define our Flow with a YAML file. In the Flow YAML file, we will add Pods in sequence. In this demo, we have two pods defined encoder
and indexer
as you can see it here:
flow-index.yml |
!Flow
version: '1'
pods:
- name: encoder
uses: yaml/encode.yml
shards: $JINA_PARALLEL
- name: indexer
uses: yaml/indexer.yml
shards: $JINA_SHARDS
timeout_ready: 10000
polling: any |
As a side note, we actually have another Pod working in silence, since the input to the very first Pod is always the Pod with the name of gateway, aka the "Forgotten" Pod. But most of the time, we can safely ignore the gateway because it essentially does the dirty work of orchestrating the work for the Flow.
Ok, we have our data indexed, and for query we need to do a similar thing.Which means we also need a Flow to process the request message during querying. The query flow looks very similar to the index flow but with an extra pod used to evaluate results.
flow-query.yml |
!Flow
version: '1'
env:
OMP_NUM_THREADS: ${{OMP_NUM_THREADS}}
with:
read_only: true
pods:
- name: encoder
show_exc_info: true
uses: yaml/encode.yml
shards: $JINA_PARALLEL
- name: indexer
polling: all
show_exc_info: true
uses: $JINA_USES
uses_internal: $JINA_USES_INTERNAL
shards: $JINA_SHARDS
timeout_ready: -1
uses_after: yaml/merge-matches-sort.yml
volumes: './workspace:/docker-workspace'
remove_uses_ba: true
docker_kwargs:
environment:
JINA_FAISS_INDEX_KEY: $JINA_FAISS_INDEX_KEY
JINA_FAISS_DISTANCE: $JINA_FAISS_DISTANCE
JINA_FAISS_NORMALIZE: $JINA_FAISS_NORMALIZE
JINA_FAISS_NPROBE: $JINA_FAISS_NPROBE
JINA_ANNOY_METRIC: $JINA_ANNOY_METRIC
JINA_ANNOY_NTREES: $JINA_ANNOY_NTREES
JINA_ANNOY_SEARCH_K: $JINA_ANNOY_SEARCH_K
OMP_NUM_THREADS: ${{OMP_NUM_THREADS}}
- name: evaluate
show_exc_info: true
uses: yaml/evaluate.yml |
All the environment
variables are added so that it is easy for the user to try out different configurations of annoy
or faiss
indexers.
In this Flow, the faiss_indexer
is the one that will do the nearest neighbours search from the given chunk (in this case, since every document has one chunk they are the same). Additionally, it will return the top_k most similiar documents in order of similiarity. Later, doc_indexer
retrieves the actual document value from the Document Id.
That was a lot of info, let's get to actually run our Flows now. To index you just run the following command
python app.py -t index
You could also change request_size if you want.
Now, to query you can choose between annoy
, faiss
, or numpy
. And you run the script like this:
python app.py -t query -i {index_type}
It is important to make sure that the environment variables in app.py
are set to the right docker image tags that one wants to test.
The results with the default parameters for Annoy, Faiss and NumpyIndexers are:
python app.py -t query -i numpy
Recall@100 => 99.47000050544739%
python app.py -t query -i faiss
Recall@100 => 47.16999990865588%
python app.py -t query -i annoy
Recall@100 => 77.69999986886978%
But feel free to look for different parameters to guarantee the best results
To make thing a little easier, we have built and published the Docker image with the ANN_SIFT1M dataset indexed. You can retrieve the docker image using:
docker pull jinahub/app.example.advancedvectorsearch:0.0.2-0.9.20
So you can pull from its latest tags and run it. By default it runs the search with faiss
indexer.
To simply run it, please do:
docker run jinahub/app.example.advancedvectorsearch:0.0.2-0.9.20
If you want to run the image with annoy
as a search library, you can override the entrypoint doing:
docker run -it --entrypoint=/bin/bash jinahub/app.example.advancedvectorsearch:0.0.2-0.9.20 entrypoint.sh annoy
If you want to change the parameters of Faiss
or Annoy
Indexer you can pass different environment variables
to the docker run
command by doing for instance:
docker run -e JINA_FAISS_INDEX_KEY='Flat' jinahub/app.example.advancedvectorsearch:0.0.2-0.9.20
An important parameter to set is JINA_DISTANCE_REVERSE
, depending on the type of distance or metric that is used. For instance for inner_product
distance, JINA_DISTANCE_REVERSE
should be set to True
. This is because returned measure for Faiss
is similarity and not distance. Which means that the results should be sorted in descending order to get what we would expect.
Another parameter that cannot be found in the init
arguments of FaissIndexer
or AnnoyIndexer
is OMP_NUM_THREADS
. This controls how many threads are used by Faiss
when querying. And since the image has been built with 4 shards (around 250K documents each), the OMP_NUM_THREADS
is set to 1 to have the example use 4 CPUs. But also feel free to tweak this parameter to check the quality and speed of the results.
In this example we have seen how to use different indexers as vector databases and how to use a ref_indexer
as a base indexer. We also have seen how to use a pod inside a docker container inside our index and query flows, and how to use evaluators to assess the quality of our search system.
Where to go from here? You can always try:
- different kinds of inverted indices and options from FAISS or Annoy.
- other indexers.
- indexing larger datasets.
Finally, play around with different evaluation metrics.
Enjoy Coding with Jina!

The best way to learn Jina in depth is to read our documentation. Documentation is built on every push, merge, and release event of the master branch. You can find more details about the following topics in our documentation.
- Jina command line interface arguments explained
- Jina Python API interface
- Jina YAML syntax for executor, driver and flow
- Jina Protobuf schema
- Environment variables used in Jina
- ... and more
- Slack channel - a communication platform for developers to discuss Jina
- Community newsletter - subscribe to the latest update, release and event news of Jina
- LinkedIn - get to know Jina AI as a company and find job opportunities
- follow us and interact with us using hashtag
#JinaSearch
- Company - know more about our company, we are fully committed to open-source!
Copyright (c) 2021 Jina AI Limited. All rights reserved.
Jina is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.