An LLM-powered chatbot for natural language database queries with extensive observability
A web chatbot interface for database interactions using natural language questions through various interaction methods (RAG, TAG) with different LLMs, including comprehensive observability and tracking.
- Multiple interaction methods (RAG, TAG)
- LLM provider selection (OpenAI, Claude)
- Intent classification (Details in Classifier README)
- Vector search with PGVector
- Langfuse Analytics
- Conversation memory (until browser refresh)
- Docker-based deployment
- Docker and Docker Compose
- Python 3.9+
- OpenAI API key
- Anthropic API key
- Langfuse account (optional)
- Clone the repository:
git clone https://github.com/garyzava/chat-to-database-chatbot.git
cd chat-to-database-chatbot
-
Configure environment variables: Copy .env.example to .env and fill in your API keys and configurations.
-
Build and Start the Docker services (one-off)
One-off command:
make run
After the installation, simply run:
make up
- Or run the application in developer mode:
make dev
The developer mode installs the streamlit app locally but the databases are still installed on Docker
- Shut down the application:
make down
Running in local mode (make dev), go to the chat2dbchatbot directory. Make sure the virtual enviroment has been activated. Open a new terminal:
cd chat2dbchatbot
Run the RAG utility
python -m tools.rag "what is the track with the most revenue" --llm OpenAI --temperature 0.1
Or run the TAG utility
python -m tools.tag "what is the track with the most revenue" --llm OpenAI --temperature 0.1
- Go to
http://localhost:8501
for the main chatbot interface - Select your preferred interaction method (RAG, TAG)
- Choose an LLM provider (OpenAI or Claude)
- Start asking questions about your database
- Go to
http://localhost:3000
for the Langfuse interface when not running on dev mode
- Frontend: Streamlit
- Document Parsing: Docling
- Vector Database: PostgreSQL with pgvector
- Observability: Langfuse
- LLM Framework: LlamaIndex
- Container Orchestration: Docker Compose
- RAG (Retrieval-Augmented Generation): Paper by Facebook AI
- TAG (Table-Augmented Generation): Paper by UC Berkeley & Stanford University
This project uses the Chinook database, a media store database, for development and testing purposes. However, it can be easily adapted for any enterprise or domain-specific use case.
- Chinook Database:
- Ownership: Maintained by lerocha
- Licenses and Use: The Chinook Database allows use, distribution, and modification without any warranty of any kind.
- Access: Available on GitHub at https://github.com/lerocha/chinook-database
The intent classifier piece uses data from the following datasets. Access to the data is subject to the respective terms:
-
GretelAI Synthetic Text-to-SQL:
- Ownership: Gretel.ai
- Licenses and Use: Licensed under the Apache License 2.0, permitting use, distribution, and modification with proper attribution.
- Access: Available on Hugging Face at https://huggingface.co/datasets/gretelai/synthetic_text_to_sql
-
Factoid WebQuestions Dataset:
- Ownership: WebQuestions (http://nlp.stanford.edu/software/sempre/ - Berant et al., 2013, CC-BY)
- Licenses and Use: Distributed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, allowing sharing and adaptation with appropriate credit.
- Access: Available on GitHub at https://github.com/brmson/dataset-factoid-webquestions
- Located under the eval sub-folder
- Evaluation Framework README is here