text2sql

Project description

Problem statement

The goal of the project is to develop an LLM-based method to translate natural language questions to executable SQL queries. The accuracy of the method is evaluated against the validation subset of the Spider dataset by comparison with the expected SQL query, the execution results and execution time.

Implementation

We implemented the following structures of prompts of zero-shot (no examples) and few-shot (several examples) learning:

Example organizations: Full Text, SQL only and DAIL-SQL.
Example selection strategies: Random, Question Similarity (QTS), Masked Question Similarity (MQS), and Query Similarity Selections (QSS)

The evaluation was done using exact-set-match accuracy (EM), execution accuracy (EX) and execution time.

Besides that, we added LoRA fine-tuning on LLM to improve accuracy.

Evaluations

We have run experiments on Ministral-8B-Instruct-2410 and Meta-Llama-3.1-8B-Instruct. Results are present on the following plot:

Fine-tuning with LoRA of Meta-Llama-3.1-8B-Instruct improved the performance over the majority of the prompt-organisations. On the following plot we display comparison of executing accuracy on basicLlama-3.1-8B and its fine-tuned version with LoRA.

How to run code?

Installation & Setup

To access Hugging Face models, create a .env file in the root folder of this project and paste your Hugging Face access token there.

// .env
HF_TOKEN={your access token}

Install necessary dependencies:

pip install transformers bitsandbytes accelerate datasets outlines scikit-learn python-dotenv nltk gdown peft nltk

Prediction

To run the evaluation script, you first need to generate files with predictions.

To do that, run the following command:

python main.py predict

# or

python main.py predict --params_path params_llama.yaml

The files will be generated in the results folder.

Evaluation

The evaluation script is based on that from https://github.com/taoyds/spider/tree/master.

Before running the script, please make sure to download the databases from the test suite and place them in the root directory of this project:

gdown 1mkCx2GOFIqNesD4y8TDAO1yX1QZORP5w
unzip testsuitedatabases.zip -d text2sql

python main.py evaluate

# or

python main.py evaluate --params_path params_llama.yaml

Fine-tuning

To fine-tune LLM with QLoRA, add fine_tune block to your params.yaml and run:

python main.py fine-tune

# or

python main.py fine-tune --params_path params.yaml

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
evaluations		evaluations
logs		logs
plots		plots
predictions		predictions
qlora_llama5B_spider/checkpoint-1000		qlora_llama5B_spider/checkpoint-1000
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
main.py		main.py
params_llama.yaml		params_llama.yaml
params_llama_finetuned.yaml		params_llama_finetuned.yaml
params_ministral.yaml		params_ministral.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

text2sql

Project description

Problem statement

Implementation

Evaluations

How to run code?

Installation & Setup

Prediction

Evaluation

Fine-tuning

About

Releases

Packages

Contributors 3

Languages

k-tarelkina/text2sql

Folders and files

Latest commit

History

Repository files navigation

text2sql

Project description

Problem statement

Implementation

Evaluations

How to run code?

Installation & Setup

Prediction

Evaluation

Fine-tuning

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages