About this example: | |
---|---|
Learnings | How to configure Jina for querying while indexing |
Used for indexing | Text data |
Used for querying | Text data |
Dataset used | Wikipedia dataset from kaggle |
Model used | flair-text |
This is an example of using Jina to support both querying and indexing simultaneously in our Wikipedia sentence search example.
- Prerequisites
- What is querying while indexing?
- Configuration changes
- 🐍 Build the app with Python
- Flow diagrams
- 🔮 Overview of the files
- Troubleshooting
- ⏭️ Next steps
- 👩👩👧👦 Community
- 🦄 License
- Run and understand our Wikipedia sentence search example
Querying while indexing means you are able to still query your data while new data is simultaneously being inserted (or updated, or deleted). Jina achieves this with its dump-reload feature.
This feature requires you to split the Flow, one for Indexing (and Updates, Deletes) and one for Querying, and have them running at the same time. Also, you will need to replace the indexers in Flows. The Index Flow (also referred to as the Storage Flow) will require a Storage Indexer, while the Query Flow requires a Compound Searcher.
In our case we use :
- LMDBStorage, which uses a disk-based key-value storage LMDB as a storage engine.
- FaissLMDBSearcher, which uses the
faiss
algorithm to provide faster query results and LMDB to retrieve the metadata.
These instructions explain how to run the example yourself and deploy it with Python.
- Have a working Python 3.7 or 3.8 environment.
- We recommend creating a new Python virtual environment to have a clean installation of Jina and prevent dependency conflicts.
- Install Docker Engine.
- Have at least 5 GB of free space on your hard drive.
Begin by cloning the repo so you can get the required files and datasets. (If you already have the examples repository on your machine make sure to fetch the most recent version)
git clone https://github.com/jina-ai/examples
cd examples/wikipedia-sentences-query-while-indexing
Let's install jina
and the other required libraries. For further information on installing jina check out our documentation.
pip install -r requirements.txt
In order to run the example you will need to do the following:
The repo includes a small subset of the Wikipedia dataset, for quick testing. You can just use that.
If you want to use the entire dataset, run bash get_data.sh
and then modify the DATA_FILE
constant (in app.py
) to point to that file.
In this example, we use JinaD to serve the two Flows (Index and Query) and listen to incoming requests.
-
Start
JinaD
server using the below command.docker run --add-host host.docker.internal:host-gateway \ -v /var/run/docker.sock:/var/run/docker.sock \ -v /tmp/jinad:/tmp/jinad \ -p 8000:8000 \ --name jinad \ -d jinaai/jina:2.1.0-daemon
-
Run
python app.py -t flows
This will create the two Flows, and then repeatedly do the following (which can also be done in any other REST client), every 10 seconds:
- Index 5 Documents.
- Send a
DUMP
request to the Storage (Index) Flow to dump its data to a specific location. - Send a
ROLLING_UPDATE
request to the Query Flow to take down its Indexers and start them again, with the new data located at the respective path.
Warning: If you want to use the entire wikipedia dataset, run
bash get_data.sh
and then modify theDATA_FILE
constant to point to that file.
Finally, in a second terminal, run python app.py -t client
This will prompt you for a query, send the query to the Query Flow, and then show you the results.
Since the Flows uses http
protocol, you can query the REST API with whatever Client
provided within jina or use cURL
, Postman
or custom Swagger UI provided with jina etc.
JinaD creates several containers during this process. In order to remove all the containers do the following after you are done using the example:
docker stop $(docker ps -a -q)
and
docker rm $(docker ps -a -q)
Below you can see a graphical representation of the Flow pipeline:
Notice the following:
- the encoder has the same configuration
- the Query Flow uses replicas. One replica continues to serve requests while the other is being reloaded.
- the Indexer in the Query Flow is actually made up of two Indexers: one for vectors, one for Document metadata. On the Storage Flow, this data is stored into one Storage Indexer.
File or folder | Contents |
---|---|
📂 data/ |
Folder where the data files are stored |
📂 flows/ |
Folder to store Flow configuration |
--- 📃 storage.yml |
YAML file to configure Storage (Index) Flow |
--- 📃 query.yml |
YAML file to configure Querying Flow |
🐍 app.py |
Code file for the example |
Did you like this example and are you interested in building your own? For a detailed tutorial on how to build your Jina app check out How to Build Your First Jina App guide in our documentation.
If you have any issues following this guide, you can always get support from our Slack community .
- Slack channel - a communication platform for developers to discuss Jina.
- LinkedIn - get to know Jina AI as a company and find job opportunities.
- - follow us and interact with us using hashtag
#JinaSearch
. - Company - know more about our company. We are fully committed to open-source!
Copyright (c) 2021 Jina AI Limited. All rights reserved.
Jina is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.