Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor 数据生成与相关介绍 #81

Closed
wants to merge 8 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 25 additions & 42 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,14 +31,14 @@ Large Language Models (LLMs) have achieved impressive results in existing benchm

The following publicly available text-to-sql datasets are used for this project:

- [SPIDER](https://yale-lily.github.io/spider): A complex text2sql dataset across domains, containing 10,181 natural language queries, 5,693 SQL distributed across 200 separate databases, covering 138 different domains.[download link](https://drive.google.com/uc?export=download&id=1TqleXec_OykOYFREKKtschzY29dUcVAQ)
- [WikiSQL:](https://github.com/salesforce/WikiSQL) A large semantic parsing dataset consisting of 80,654 natural statement expressions and sql annotations of 24,241 tables. Each query in WikiSQL is limited to the same table and does not contain complex operations such as sorting, grouping The queries in WikiSQL are limited to the same table and do not include complex operations such as sorting, grouping, subqueries, etc.
- [SPIDER](https://yale-lily.github.io/spider): A complex text2sql dataset across domains, containing 10,181 natural language queries, 5,693 SQL distributed across 200 separate databases, covering 138 different domains.
- [CHASE](https://xjtu-intsoft.github.io/chase/): A cross-domain multi-round interactive text2sql Chinese dataset containing a list of 5,459 multi-round questions consisting of 17,940 <query, SQL> binary groups across 280 different domain databases.
- [BIRD-SQL:](https://bird-bench.github.io/) A large-scale cross-domain text-to-SQL benchmark in English, with a particular focus on large database content. The dataset contains 12,751 text-to-SQL data pairs and 95 databases with a total size of 33.4 GB across 37 occupational domains. The BIRD-SQL dataset bridges the gap between text-to-SQL research and real-world applications by exploring three additional challenges, namely dealing with large and messy database values, external knowledge inference and optimising SQL execution efficiency.
- [CoSQL:](https://yale-lily.github.io/cosql) A corpus for building cross-domain conversational text-to-SQL systems. It is a conversational version of the Spider and SParC tasks. CoSQL consists of 30k+ rounds and 10k+ annotated SQL queries from Wizard-of-Oz's collection of 3k conversations querying 200 complex databases across 138 domains. Each conversation simulates a realistic DB query scenario in which a staff member explores the database as a user and a SQL expert uses SQL to retrieve answers, clarify ambiguous questions, or otherwise inform.


After the data is downloaded by default, it is placed under the first-level directory data, such as data/spider.


### 2.2. Model

Expand Down Expand Up @@ -134,19 +134,16 @@ This data is then expressed in natural language, e.g:
"output": "select count(*) from head where age > 56"}
```

You can from the [link](https://drive.google.com/uc?export=download&id=1TqleXec_OykOYFREKKtschzY29dUcVAQ) download the spider data,By default, after Unzip the data and place it under the directory dbgpt_hub/data, which means the path is dbgpt_hub/data/spider.

The code implementation of the above data pre-processing section is as follows:

```bash
## Generate train data
python dbgpt_hub/utils/sql_data_process.py

## Generate dev data
python dbgpt_hub/utils/sql_data_process.py \
--data_filepaths data/spider/dev.json \
--output_file dev_sql.json \
## Generate train data and dev data
sh dbgpt_hub/scripts/train_eval_data_gen.sh

```
If you don't want to do this step, you can [download](https://drive.google.com/drive/folders/1MkNSJgJn9mTH5TTjdn06gf5N3ghG1wnn?usp=drive_link) the data set we've already processed, Then put it under the project
In the dbgpt_hub/data directory, you will obtain the newly generated training file example_text2sql_train.json and the testing file example_text2sql_dev.json, with data counts of 8659 and 1034 respectively.

When fine-tuning the model, we also customize the prompt dict to optimize the input:

Expand All @@ -170,44 +167,24 @@ SQL_PROMPT_DICT = {

### 3.3. Model fine-tuning

Model fine-tuning uses the QLoRA/LoRA method, where we can run the following command to fine-tune the model:
The model fine-tuning supports both qlora and lora methods. We can run the following command to fine-tune the model. By default, with the parameter --quantization_bit, it uses the qlora fine-tuning method. To switch to lora, simply remove the related parameter from the script.
Run the command:

```bash
python train_qlora.py --model_name_or_path <path_or_name>
sh dbgpt_hub/scripts/train_sft.sh
```
The fine-tuned model weights are saved under the adapter folder by default. The full training script is in scripts/qlora/qlora.sh.For multi-card runs, scripts/spider_qlora_finetune.sh is based on QLoRA by default, so it is recommended to specify the GPU number to run at the beginning. e.g. from `python src/train/train_qlora.py` to `CUDA_VISIBLE_DEVICES=0,1,2,3 python src/train/train_qlora.py`

```bash
python train_lora.py --model_name_or_path <path_or_name>
```
The full training script is in scripts/lora/.

If you want to merge the finetuning weights into the base model, you can execute the following command:

```bash
python dbgpt_hub/utils/merge_peft_adapters.py --peft_model_path Your_adapter_model
```
After fine-tuning, the model weights will be saved by default in the adapter folder, specifically in the dbgpt_hub/output/adapter directory.

### 3.4. Model Predict

Create the ./data/out_pred/ folder under the project directory. This is the default output location.
Under the project directory ./dbgpt_hub/output/pred/, this folder is the default output location for model predictions.

- Prediction just on base model
```bash
sh scripts/no_peft/get_predict_no_peft_llama2_13b_hf.sh
sh ./dbgpt_hub/scripts/predict_sft.sh
```

- Prediction based on LoRA
Run the following script:
```bash
sh scripts/lora/get_predict_lora.sh
```

- Prediction based on QLoRA
Run the following script:
```bash
sh scripts/qlora/get_predict_qlora.sh
```
In the script, by default with the parameter --quantization_bit, it predicts using QLoRA. Removing it switches to the LoRA prediction method.

### 3.5 Model Weights
You can find weights from huggingface. [hg-eosphoros-ai
Expand All @@ -228,23 +205,29 @@ The whole process we will divide into three phases:
* Stage 1:
* Set up the basic framework, enabling end-to-end workflow from data processing, model SFT training, prediction output to evaluation based on multiple large models. As of 20230804, the entire pipeline has been established.
now,we has supported as follows:
- [x] CodeLlama
- [x] Baichuan2
- [x] LLaMa/LLaMa2
- [x] Falcon
- [x] CodeLlama
- [x] Qwen
- [x] XVERSE
- [x] ChatGLM2
- [x] internlm

* Stage 2:
* Optimize model performance, support fine-tuning more different models in various ways.
* Optimize `prompts`
* Release evaluation results, optimize `DB-GPT-SFT` models
* Stage 3:
* Based on more papers, conduct optimizations, such as `RESDSQL`, etc.
* Optimized based on more papers, such as RESDSQL and others. Combined with our community's sibling project[Awesome-Text2SQL](https://github.com/eosphoros-ai/Awesome-Text2SQL)for further enhancements..

## 5. Contributions

We welcome more folks to participate and provide feedback in areas like datasets, model fine-tuning, performance evaluation, paper recommendations, code reproduction, etc. Feel free to open issues or PRs and we'll actively respond.
We welcome more folks to participate and provide feedback in areas like datasets, model fine-tuning, performance evaluation, paper recommendations, code reproduction, etc. Feel free to open issues or PRs and we'll actively respond.Before submitting the code, please format it using the black style.

## 6. Acknowledgements

Thanks to the following open source projects
Our work is primarily based on the foundation of numerous open-source contributions. Thanks to the following open source projects

* [Spider](https://github.com/ElementAI/spider)
* [CoSQL](https://yale-lily.github.io/cosql)
Expand All @@ -257,4 +240,4 @@ Thanks to the following open source projects
* [WizardLM](https://github.com/nlpxucan/WizardLM)
* [text-to-sql-wizardcoder](https://github.com/cuplv/text-to-sql-wizardcoder)
* [test-suite-sql-eval](https://github.com/taoyds/test-suite-sql-eval)

* [LLaMa-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning)
82 changes: 31 additions & 51 deletions README.zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,14 +30,13 @@ DB-GPT-Hub是一个利用LLMs实现Text-to-SQL解析的实验项目,主要包
### 2.1、数据集

本项目主要使用了以下公开的text2sql数据集:

- [Spider](https://yale-lily.github.io/spider): 一个跨域的复杂text2sql数据集,包含了10,181条自然语言问句、分布在200个独立数据库中的5,693条SQL,内容覆盖了138个不同的领域。[下载链接](https://drive.google.com/uc?export=download&id=1TqleXec_OykOYFREKKtschzY29dUcVAQ)
- [WikiSQL:](https://github.com/salesforce/WikiSQL) 一个大型的语义解析数据集,由80,654个自然语句表述和24,241张表格的sql标注构成。WikiSQL中每一个问句的查询范围仅限于同一张表,不包含排序、分组、子查询等复杂操作。
- [Spider](https://yale-lily.github.io/spider): 一个跨域的复杂text2sql数据集,包含了10,181条自然语言问句、分布在200个独立数据库中的5,693条SQL,内容覆盖了138个不同的领域。
- [CHASE](https://xjtu-intsoft.github.io/chase/): 一个跨领域多轮交互text2sql中文数据集,包含5459个多轮问题组成的列表,一共17940个<query, SQL>二元组,涉及280个不同领域的数据库。
- [BIRD-SQL:](https://bird-bench.github.io/)数据集是一个英文的大规模跨领域文本到SQL基准测试,特别关注大型数据库内容。该数据集包含12,751对文本到SQL数据对和95个数据库,总大小为33.4GB,跨越37个职业领域。BIRD-SQL数据集通过探索三个额外的挑战,即处理大规模和混乱的数据库值、外部知识推理和优化SQL执行效率,缩小了文本到SQL研究与实际应用之间的差距。
- [CoSQL:](https://yale-lily.github.io/cosql)是一个用于构建跨域对话文本到sql系统的语料库。它是Spider和SParC任务的对话版本。CoSQL由30k+回合和10k+带注释的SQL查询组成,这些查询来自Wizard-of-Oz的3k个对话集合,查询了跨越138个领域的200个复杂数据库。每个对话都模拟了一个真实的DB查询场景,其中一个工作人员作为用户探索数据库,一个SQL专家使用SQL检索答案,澄清模棱两可的问题,或者以其他方式通知。

默认将数据下载后,放在一级目录data下面,如data/spider .


### 2.2、基座模型

Expand Down Expand Up @@ -134,18 +133,14 @@ DB-GPT-HUB使用的是信息匹配生成法进行数据准备,即结合表信

```

以上数据预处理部分的代码实现如下:
从[下载链接](https://drive.google.com/uc?export=download&id=1TqleXec_OykOYFREKKtschzY29dUcVAQ) 下载spider数据集,默认将数据下载解压后,放在目录dbgpt_hub/data下面,即路径为dbgpt_hub/data/spider

数据预处理部分的代码实现如下:
```bash
## 生成train数据
python dbgpt_hub/utils/sql_data_process.py

## 生成dev数据
python dbgpt_hub/utils/sql_data_process.py \
--data_filepaths data/spider/dev.json \
--output_file dev_sql.json \
## 生成train数据 和dev数据,
sh dbgpt_hub/scripts/train_eval_data_gen.sh
```
如果你不想做这一步,你可以[下载](https://drive.google.com/drive/folders/1MkNSJgJn9mTH5TTjdn@6gf5N3ghG1wnn?usp-drive_link)我们已经处理过的数据集,然后把它放在该项目文件夹下即可
在dbgpt_hub/data目录你会得到新生成的训练文件example_text2sql_train.json 和测试文件example_text2sql_dev.json ,数据量分别为8659和1034条。

在模型微调时,我们还定制了prompt dict以优化输入:

Expand All @@ -168,47 +163,26 @@ SQL_PROMPT_DICT = {

### 3.3、模型微调

模型微调使用的是qlora和lora方法,我们可以运行以下命令来微调模型:
模型微调支持qlora和lora方法,我们可以运行以下命令来微调模型,默认带着参数`--quantization_bit `为qlora的微调方式,转换为lora只需在脚本中去掉相关参数即可。
运行命令:

```bash
python train_qlora.py --model_name_or_path <path_or_name>
sh dbgpt_hub/scripts/train_sft.sh
```

微调后的模型权重会默认保存到adapter文件夹下面。完整的训练脚本在scripts/qlora/qlora.sh中。
对于多卡运行,scripts/spider_qlora_finetune.sh中由于默认是基于QLoRA,建议在一开始就指定运行的GPU编号。如由`python src/train/train_qlora.py` 改为`CUDA_VISIBLE_DEVICES=0,1,2,3 python src/train/train_qlora.py` 。
微调后的模型权重会默认保存到adapter文件夹下面,即dbgpt_hub/output/adapter目录中。

当使用lora微调时,我们可以用以下指令:

```bash
python train_lora.py --model_name_or_path <path_or_name>
```
完整的训练脚本在scripts/lora/中。
### 3.4、模型预测
项目目录下`./dbgpt_hub/output/pred/`,此文件夹为关于模型预测默认输出的位置。

如果需要将微调权重合并到base模型中,可以执行以下命令

```bash
python dbgpt_hub/utils/merge_peft_adapters.py --peft_model_path Your_adapter_model
sh ./dbgpt_hub/scripts/predict_sft.sh
```

### 3.4、模型预测
项目目录下建`./data/out_pred/`文件夹,此文件夹为默认输出的位置。

- 基于基础模型的直接预测
```bash
sh scripts/no_peft/get_predict_no_peft_llama2_13b_hf.sh
```
脚本中默认带着参数`--quantization_bit `为QLoRA的预测,去掉即为LoRA的预测方式。

- 基于LoRA的预测
执行如下脚本命令
```bash
sh scripts/lora/get_predict_lora.sh
```
相关默认输入输出可见, ./data/out_pred/
- 基于QLoRA的预测
执行如下脚本命令
```bash
sh scripts/qlora/get_predict_qlora.sh
```

# 3.5、模型权重
可以从Huggingface查看对应的模型权重。 [huggingface地址](https://huggingface.co/eosphoros)
Expand All @@ -218,7 +192,7 @@ sh scripts/qlora/get_predict_qlora.sh
运行以下命令来:

```bash
python eval/evaluation.py --plug_value --input Your_model_pred_file
python dbgpt_hub/eval/evaluation.py --plug_value --input Your_model_pred_file
```
你可以在[这里](docs/eval_llm_result.md)找到我们最新的审查结果。
## 四、发展路线
Expand All @@ -227,31 +201,36 @@ python eval/evaluation.py --plug_value --input Your_model_pred_file

* 阶段一:
* 搭建基本框架,基于数个大模型打通从数据处理、模型SFT训练、预测输出和评估的整个流程,截止`20230804`我们已经整个打通。
我们现在支持
我们现在支持
- [x] CodeLlama
- [x] Baichuan2
- [x] LLaMa/LLaMa2
- [x] Falcon
- [x] CodeLlama
- [x] Qwen
- [x] XVERSE
- [x] ChatGLM2
- [x] internlm




We preliminarily plan to support the following models going forward. If there are new and better models, we'll keep an eye out and follow up too. Feel free to open an issue to suggest any, we'll glad to see your issues.
- [ ] ChatGLM
- [ ] BLOOM
- [ ] CodeGeeX
- [ ] WizardLM


* 阶段二:
* 优化模型效果,支持更多不同模型进行不同方式的微调。
* 对`prompt`优化
* 放出评估效果,优化后`DB-GPT-SFT`模型
* 阶段三:
* 基于更多论文进行优化,如`RESDSQL`等;
* 基于更多论文进行优化,如`RESDSQL`等,结合我们社区的兄弟项目[Awesome-Text2SQL](https://github.com/eosphoros-ai/Awesome-Text2SQL)进行更多的优化

## 五、贡献

欢迎更多小伙伴在数据集、模型微调、效果评测、论文推荐与复现等方面参与和反馈,如提issues或者pr反馈,我们会积极给出回应。
欢迎更多小伙伴在数据集、模型微调、效果评测、论文推荐与复现等方面参与和反馈,如提issues或者pr反馈,我们会积极给出回应。提交代码前请先将代码按black格式化。

## 六、感谢

感谢以下开源项目
我们的工作主要是在众多开源工作的基础上开展的,非常感谢以下开源项目。

* [Spider](https://github.com/ElementAI/spider)
* [CoSQL](https://yale-lily.github.io/cosql)
Expand All @@ -264,4 +243,5 @@ python eval/evaluation.py --plug_value --input Your_model_pred_file
* [WizardLM](https://github.com/nlpxucan/WizardLM)
* [text-to-sql-wizardcoder](https://github.com/cuplv/text-to-sql-wizardcoder)
* [test-suite-sql-eval](https://github.com/taoyds/test-suite-sql-eval)
* [LLaMa-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning)

18 changes: 18 additions & 0 deletions dbgpt_hub/configs/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,24 @@
LAYERNORM_NAMES = ["norm", "ln_f", "ln_attn", "ln_mlp"]
EXT2TYPE = {"csv": "csv", "json": "json", "jsonl": "json", "txt": "text"}

# text2sql dataset information for processing sql data
SQL_DATA_INFO = [
{
"data_source": "spider",
"train_file": ["train_spider.json", "train_others.json"],
"dev_file": ["dev.json"],
"tables_file": "tables.json",
"db_id_name": "db_id",
"is_multiple_turn": False
}
]
INSTRUCTION_PROMPT = """\
I want you to act as a SQL terminal in front of an example database, \
you need only to return the sql command to me.Below is an instruction that describes a task, \
Write a response that appropriately completes the request.\n"
##Instruction:\n{}\n"""
INPUT_PROMPT = "###Input:\n{}\n\n###Response:"

# METHODS = ["full", "freeze", "lora"]

# STAGES = ["SFT", "Reward Modeling", "PPO", "DPO", "Pre-Training"]
Expand Down
5 changes: 2 additions & 3 deletions dbgpt_hub/data/dataset_info.json
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
{
"example_text2sql": {
"file_name": "example_text2sql.json",
"file_name": "example_text2sql_train.json",
"columns": {
"prompt": "instruction",
"query": "input",
"response": "output",
"history": "history"
},
"stage": "sft"
}
}
}
Loading