A Prototype for a Wikidata Question-Answering System
This system allows users to query Wikidata using natural language questions. The responses contain links to sources. If Wikidata does not provide the information requested, the system refuses to answer.
The system is in an early proof of concept state.
To give it a try, use ➡️ this Google Colab Notebook or load AskWikidata_Quickstart.ipynb
in your infrastructure.
In order to answer questions based on Wikidata, the system uses retrieval augmented generation. First it transforms Wikidata items to text and generates embeddings for them. The user query is then embedded as well. Using nearest neighbor search, most relevant Wikidata items are identified. A reranker model selects only the best matches from the neighbors. Finally, these matches are incorporated into the LLM prompt in order to allow the LLM to generate using Wikidata knowledge.
All models, including the LLM, can run on the local machine using pytorch
and bitsandbytes
quantization. For nearest neighbor search, an annoy
index is used.
On Nix the dev shell will install all required dependencies.
nix develop .
Alternatively, install python requirements using pip.
pip install -r requirements.txt
For faster execution, the results of some pre-computation steps are cached. In order to use those caches, unpack them:
bunzip2 --keep --force *.json.bz2
Generate text representations for Wikidata items. The list of items to use is currently hardcoded in text_representation.py
.
python text_representation.py
This python code will use AskWikidata to answer one question.
from askwikidata import AskWikidata
config = {
"chunk_size": 1280,
"chunk_overlap": 0,
"index_trees": 1024,
"retrieval_chunks": 16,
"context_chunks": 5,
"embedding_model_name": "BAAI/bge-small-en-v1.5",
"reranker_model_name": "BAAI/bge-reranker-base",
"qa_model_url": "Qwen/Qwen2.5-3B-Instruct",
}
askwikidata = AskWikidata(**config)
askwikidata.setup()
print(askwikidata.ask("Who is the current mayor of Berlin? And since when is them serving?"))
A simple interactive read eval print loop can be used to ask questions.
python repl.py
A script to evaluate the performance of different configurations is provided.
python eval.py
If you do not want to use a local LLM, AskWikidata can access the Huggingface LLM API. Configure your Hugginface API key in the HUGGINGFACE_API_KEY
environment variable.
To execute the unit test suite, run:
$ python -m unittest
To get a coverage report, run
$ coverage run -m unittest
$ coverage report --omit="test_*,/nix/*" --show-missing