Skip to content

Commit

Permalink
1. refine the english readme 2. change requirementTXT to poetry 3. re…
Browse files Browse the repository at this point in the history
…fine code style with black 4. ignore vscode file
  • Loading branch information
qidanrui committed Nov 12, 2023
1 parent 6023459 commit b1b88ca
Show file tree
Hide file tree
Showing 10 changed files with 4,192 additions and 128 deletions.
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -100,12 +100,15 @@ target/
# Jupyter Notebook
.ipynb_checkpoints
# vscode ignore
# .vscode/
.vscode/

# IPython
profile_default/
ipython_config.py

# Poetry
.dist/

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
Expand Down
60 changes: 0 additions & 60 deletions .vscode/launch.json

This file was deleted.

3 changes: 0 additions & 3 deletions .vscode/settings.json

This file was deleted.

49 changes: 29 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,6 @@
<a href="https://github.com/eosphoros-ai/DB-GPT-Hub">
<img alt="forks" src="https://img.shields.io/github/forks/eosphoros-ai/db-gpt-hub?style=social" />
</a>
<a href="https://opensource.org/licenses/MIT">
<img alt="License: MIT" src="https://img.shields.io/badge/License-MIT-yellow.svg" />
</a>
<a href="https://opensource.org/licenses/MIT">
<img alt="License: MIT" src="https://img.shields.io/badge/License-MIT-yellow.svg" />
</a>
Expand Down Expand Up @@ -53,12 +50,11 @@

## 1. What is DB-GPT-Hub

DB-GPT-Hub is an experimental project utilizing LLMs (Large Language Models) to achieve Text-to-SQL parsing. The project primarily encompasses data collection, data preprocessing, model selection and building, and fine-tuning of weights. Through this series of processes, we aim to enhance Text-to-SQL capabilities while reducing the model training costs, allowing more developers to contribute to the improvement of Text-to-SQL accuracy. Our ultimate goal is to realize automated question-answering capabilities based on databases, enabling users to execute complex database queries through natural language descriptions.
DB-GPT-Hub is an experimental project that leverages Large Language Models (LLMs) to achieve Text-to-SQL parsing. The project encompasses various stages, including data collection, data preprocessing, model selection and construction, and fine-tuning of model weights. Through these processes, our aim is to enhance Text-to-SQL capabilities while reducing model training costs, thus enabling more developers to contribute to improving Text-to-SQL accuracy. Our ultimate goal is to realize automated question-answering capabilities based on databases, allowing users to execute complex database queries using natural language descriptions.

So far, we have successfully integrated multiple large models and established a complete workflow, including data processing, model SFT (Supervised Fine-Tuning) training, prediction output, and evaluation. The code is readily reusable within this project.
To date, we have successfully integrated multiple large models and established a comprehensive workflow that includes data processing, Supervised Fine-Tuning (SFT) model training, prediction output, and evaluation. The code developed for this project is easily reusable within the project itself.


As of 20231010, we used this project to fine-tune the open source 13B size model, combined with more relevant data, and under the zero-shot prompt, Spider-based [test-suite](https://github.com/taoyds/test-suite -sql-eval), the execution accuracy of the database (size-1.27G) can reach **0.764**, and the execution accuracy of the database (size-95M) pointed to by the Spider official [website](https://yale-lily.github.io/spider) is 0.825.
As of October 10, 2023, we have used this project to fine-tune the open-source 13B-sized model, incorporating more relevant data. Under zero-shot prompts and utilizing [the Spider-based test-suite](https://github.com/taoyds/test-suite -sql-eval), we have achieved an execution accuracy rate of 0.764 for a database with a size of 1.27G. Additionally, the execution accuracy for the database pointed to by [the Spider official website](https://yale-lily.github.io/spider), with a size of 95M, stands at 0.825.


## 2. Fine-tuning Text-to-SQL
Expand All @@ -78,7 +74,7 @@ Other text2sql datasets available:
- [BIRD-SQL:](https://bird-bench.github.io/) A large-scale cross-domain text-to-SQL benchmark in English, with a particular focus on large database content. The dataset contains 12,751 text-to-SQL data pairs and 95 databases with a total size of 33.4 GB across 37 occupational domains. The BIRD-SQL dataset bridges the gap between text-to-SQL research and real-world applications by exploring three additional challenges, namely dealing with large and messy database values, external knowledge inference and optimising SQL execution efficiency.
- [CoSQL:](https://yale-lily.github.io/cosql) A corpus for building cross-domain conversational text-to-SQL systems. It is a conversational version of the Spider and SParC tasks. CoSQL consists of 30k+ rounds and 10k+ annotated SQL queries from Wizard-of-Oz's collection of 3k conversations querying 200 complex databases across 138 domains. Each conversation simulates a realistic DB query scenario in which a staff member explores the database as a user and a SQL expert uses SQL to retrieve answers, clarify ambiguous questions, or otherwise inform.

- Following the processing template of [NSQL](https://github.com/NumbersStationAI/NSQL), the dataset underwent basic processing, yielding approximately [20K dataset](https://huggingface.co/datasets/Healthy13/Text2SQL/tree/main)
- Following the processing template of [NSQL](https://github.com/NumbersStationAI/NSQL), the dataset underwent basic processing, yielding approximately [20W dataset](https://huggingface.co/datasets/Healthy13/Text2SQL/tree/main)



Expand Down Expand Up @@ -115,8 +111,7 @@ git clone https://github.com/eosphoros-ai/DB-GPT-Hub.git
cd DB-GPT-Hub
conda create -n dbgpt_hub python=3.10
conda activate dbgpt_hub
pip install -r requirements.txt
mkdir model
poetry install
```

### 3.2. Data preparation
Expand All @@ -128,7 +123,7 @@ Download the [Spider dataset]((https://drive.google.com/uc?export=download&id=1T
For the data preprocessing part, simply **run the following script** :
```bash
## generate train and dev(eval) data
sh dbgpt_hub/scripts/gen_train_eval_data.sh
poetry run sh dbgpt_hub/scripts/gen_train_eval_data.sh
```

In the directory `dbgpt_hub/data/`, you will find the newly generated training file example_text2sql_train.json and testing file example_text2sql_dev.json, containing 8659 and 1034 entries respectively. For the data used in subsequent fine-tuning, set the parameter `file_name` value to the file name of the training set in dbgpt_hub/data/dataset_info.json, such as example_text2sql_train.json
Expand All @@ -152,7 +147,7 @@ The model fine-tuning supports both LoRA and QLoRA methods. We can run the follo
Run the command:

```bash
sh dbgpt_hub/scripts/train_sft.sh
poetry run sh dbgpt_hub/scripts/train_sft.sh
```

After fine-tuning, the model weights will be saved by default in the adapter folder, specifically in the dbgpt_hub/output/adapter directory.
Expand Down Expand Up @@ -210,7 +205,7 @@ In the script, during fine-tuning, different models correspond to key parameters
Under the project directory ./dbgpt_hub/output/pred/, this folder is the default output location for model predictions(if not exist, just mkdir).

```bash
sh ./dbgpt_hub/scripts/predict_sft.sh
poetry run sh ./dbgpt_hub/scripts/predict_sft.sh
```

In the script, by default with the parameter `--quantization_bit`, it predicts using QLoRA. Removing it switches to the LoRA prediction method.
Expand All @@ -225,7 +220,7 @@ You can find the second corresponding model weights from Huggingface [hg-eospho
If you need to merge the weights of the trained base model and the fine-tuned Peft module to export a complete model, execute the following model export script:

```bash
sh ./dbgpt_hub/scripts/export_merge.sh
poetry run sh ./dbgpt_hub/scripts/export_merge.sh
```

Be sure to replace the parameter path values in the script with the paths corresponding to your project.
Expand All @@ -234,7 +229,7 @@ Be sure to replace the parameter path values in the script with the paths corres
To evaluate model performance on the dataset, default is spider dev dataset.
Run the following command:
```bash
python dbgpt_hub/eval/evaluation.py --plug_value --input Your_model_pred_file
poetry run python dbgpt_hub/eval/evaluation.py --plug_value --input Your_model_pred_file
```
You can find the results of our latest review and part of experiment results [here](docs/eval_llm_result.md)
**Note**: The database pointed to by the default code is a 95M database downloaded from [Spider official website] (https://yale-lily.github.io/spider). If you need to use Spider database (size 1.27G) in [test-suite](https://github.com/taoyds/test-suite-sql-eval), please download the database in the link to the custom directory first, and run the above evaluation command which add parameters and values ​​like `--db Your_download_db_path`.
Expand All @@ -244,8 +239,9 @@ You can find the results of our latest review and part of experiment results [he
The whole process we will divide into three phases:

* Stage 1:
* Set up the basic framework, enabling end-to-end workflow from data processing, model SFT training, prediction output to evaluation based on multiple large models. As of 20230804, the entire pipeline has been established.
now,we has supported as follows:
* Set up the foundational framework, enabling an end-to-end workflow that encompasses data processing, model SFT (Single Fine-Tuning) training, prediction output, and evaluation using multiple large language models (LLMs). As of August 4th, 2023, the entire pipeline has been successfully established.

Currently, we offer support for the following features:
- [x] CodeLlama
- [x] Baichuan2
- [x] LLaMa/LLaMa2
Expand All @@ -264,11 +260,24 @@ The whole process we will divide into three phases:
- [ ] Targeted optimization and improvement of business scenarios and Chinese effects
- [ ] Optimized based on more papers, such as RESDSQL and others. Combined with our community's sibling project[Awesome-Text2SQL](https://github.com/eosphoros-ai/Awesome-Text2SQL)for further enhancements..

**If our work is even a little help to you, please give us a star to let us know ,which would be more motivation for us to release more related work.**
**If our work has provided even a small measure of assistance to you, please consider giving us a star. Your feedback and support serve as motivation for us to continue releasing more related work and improving our efforts. Thank you!**

## 5. Contributions

We welcome more folks to participate and provide feedback in areas like datasets, model fine-tuning, performance evaluation, paper recommendations, code reproduction, etc. Feel free to open issues or PRs and we'll actively respond.Before submitting the code, please format it using the black style in command `black .`.
We warmly invite more individuals to join us and actively engage in various aspects of our project, such as datasets, model fine-tuning, performance evaluation, paper recommendations, and code reproduction. Please don't hesitate to open issues or pull requests (PRs), and we will be proactive in responding to your contributions.

Before submitting your code, please ensure that it is formatted according to the black style by using the following command:
```
poetry run black dbgpt_hub
```

If you have more time to execute more detailed type checking and style checking of your code, please use the following commond:
```
poetry run pyright dbgpt_hub
poetry run pylint dbgpt_hub
```

If you have any questions or need further assistance, don't hesitate to reach out. We appreciate your involvement!

## 6. Acknowledgements

Expand Down Expand Up @@ -296,7 +305,7 @@ Thanks for all contributors !
The MIT License (MIT)

## 8、Contact Information
We are working together as a community, if you have any ideas about our community work , feel free to contact us. And you're interested in an in-depth experiment and optimization of the DB-GPT-Hub subproject, you can reach out to 'wangzai' in the WeChat group, we are welcome to make it better togther.
We are collaborating as a community, and if you have any ideas regarding our community work, please don't hesitate to get in touch with us. If you're interested in delving into an in-depth experiment and optimizing the DB-GPT-Hub subproject, you can reach out to 'wangzai' within the WeChat group. We wholeheartedly welcome your contributions to making it even better together!
[![](https://dcbadge.vercel.app/api/server/nASQyBjvY?compact=true&style=flat)](https://discord.gg/nASQyBjvY)

<p align="center">
Expand Down
31 changes: 22 additions & 9 deletions dbgpt_hub/data_process/sql_data_process.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,9 @@ def __init__(self, train_file=None, dev_file=None) -> None:
self.train_file = train_file
self.dev_file = dev_file

def decode_json_file(self, data_file_list, table_file, db_id_name, is_multiple_turn=False):
def decode_json_file(
self, data_file_list, table_file, db_id_name, is_multiple_turn=False
):
"""
TO DO:
1.将相关prompt放入config中
Expand Down Expand Up @@ -87,22 +89,31 @@ def decode_json_file(self, data_file_list, table_file, db_id_name, is_multiple_t
res = []
for data in tqdm(datas):
if data[db_id_name] in db_dict.keys():
if is_multiple_turn: #多轮
if is_multiple_turn: # 多轮
history = []
for interaction in data["interaction"]:
input = {
"db_id": data[db_id_name],
"instruction": INSTRUCTION_PROMPT.format(db_dict[data[db_id_name]]),
"instruction": INSTRUCTION_PROMPT.format(
db_dict[data[db_id_name]]
),
"input": INPUT_PROMPT.format(interaction["utterance"]),
"output": interaction["query"],
"history": history,
}
res.append(input)
history.append((INPUT_PROMPT.format(interaction["utterance"]), interaction["query"]))
else: # 单轮
history.append(
(
INPUT_PROMPT.format(interaction["utterance"]),
interaction["query"],
)
)
else: # 单轮
input = {
"db_id": data[db_id_name],
"instruction": INSTRUCTION_PROMPT.format(db_dict[data[db_id_name]]),
"instruction": INSTRUCTION_PROMPT.format(
db_dict[data[db_id_name]]
),
"input": INPUT_PROMPT.format(data["question"]),
"output": data["query"],
"history": [],
Expand All @@ -125,7 +136,7 @@ def create_sft_raw_data(self):
DATA_PATH, data_info["data_source"], data_info["tables_file"]
),
db_id_name=data_info["db_id_name"],
is_multiple_turn=data_info['is_multiple_turn']
is_multiple_turn=data_info["is_multiple_turn"],
)
)

Expand All @@ -140,7 +151,7 @@ def create_sft_raw_data(self):
DATA_PATH, data_info["data_source"], data_info["tables_file"]
),
db_id_name=data_info["db_id_name"],
is_multiple_turn=data_info['is_multiple_turn']
is_multiple_turn=data_info["is_multiple_turn"],
)
)
with open(self.train_file, "w", encoding="utf-8") as s:
Expand All @@ -152,5 +163,7 @@ def create_sft_raw_data(self):
if __name__ == "__main__":
all_in_one_train_file = os.path.join(DATA_PATH, "example_text2sql_train.json")
all_in_one_dev_file = os.path.join(DATA_PATH, "example_text2sql_dev.json")
precess = ProcessSqlData(train_file=all_in_one_train_file, dev_file=all_in_one_dev_file)
precess = ProcessSqlData(
train_file=all_in_one_train_file, dev_file=all_in_one_dev_file
)
precess.create_sft_raw_data()
3 changes: 1 addition & 2 deletions dbgpt_hub/llm_base/config_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,8 +93,7 @@ def parse_infer_args(


def get_train_args(
args: Optional[Dict[str, Any]] = None,
data_args_init: bool = True
args: Optional[Dict[str, Any]] = None, data_args_init: bool = True
) -> Tuple[
ModelArguments,
DataArguments,
Expand Down
4 changes: 3 additions & 1 deletion dbgpt_hub/llm_base/model_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -401,7 +401,9 @@ def plot_loss(
def export_model(
args: Optional[Dict[str, Any]] = None, max_shard_size: Optional[str] = "10GB"
):
model_args, _, training_args, finetuning_args, _ = get_train_args(args, data_args_init=False)
model_args, _, training_args, finetuning_args, _ = get_train_args(
args, data_args_init=False
)
model, tokenizer = load_model_and_tokenizer(model_args, finetuning_args)
model.save_pretrained(training_args.output_dir, max_shard_size=max_shard_size)
try:
Expand Down
Loading

0 comments on commit b1b88ca

Please sign in to comment.