diff --git a/README.md b/README.md index f90796d..b58ab25 100644 --- a/README.md +++ b/README.md @@ -21,22 +21,30 @@ ## 1. What is DB-GPT-Hub -DB-GPT-Hub is an experimental project to implement Text-to-SQL parsing using LLMs, which mainly includes dataset collection, data pre-processing, model selection and construction, and fine-tuning weights, etc. Through this series of processing, we can reduce the model training cost while improving Text-to-SQL capability, and allow more developers to participate in the work of improving the accuracy of Text-to-SQL, and finally realizing the automatic database based question and answer capability, allowing users to complete complex database query operations through natural language descriptions. +DB-GPT-Hub is an experimental project utilizing LLMs (Large Language Models) to achieve Text-to-SQL parsing. The project primarily encompasses data collection, data preprocessing, model selection and building, and fine-tuning of weights. Through this series of processes, we aim to enhance Text-to-SQL capabilities while reducing the model training costs, allowing more developers to contribute to the improvement of Text-to-SQL accuracy. Our ultimate goal is to realize automated question-answering capabilities based on databases, enabling users to execute complex database queries through natural language descriptions. + +So far, we have successfully integrated multiple large models and established a complete workflow, including data processing, model SFT (Supervised Fine-Tuning) training, prediction output, and evaluation. The code is readily reusable within this project. + +As of October 10, 2023, by fine-tuning an open-source model of 13 billion parameters using this project, **the execution accuracy on the Spider evaluation dataset has surpassed that of GPT-4!** ## 2. Fine-tuning Text-to-SQL -Large Language Models (LLMs) have achieved impressive results in existing benchmark tests of Text-to-SQL. However, these models remain challenging in the face of large databases and noisy content, and the mysteries behind the huge database values need external knowledge and reasoning to be revealed. We enhance Text-to-SQL based on a large language models sustained SFT +We enhance the Text-to-SQL performance by applying Supervised Fine-Tuning (SFT) on large language models. ### 2.1. Dataset -The following publicly available text-to-sql datasets are used for this project: +The primary dataset for this project's examples is the **Spider** dataset: + +- [SPIDER](https://yale-lily.github.io/spider): A complex text2sql dataset across domains, containing 10,181 natural language queries, 5,693 SQL distributed across 200 separate databases, covering 138 different domains.[download link](https://drive.google.com/uc?export=download&id=1TqleXec_OykOYFREKKtschzY29dUcVAQ) + +Other text2sql datasets available: -- [SPIDER](https://yale-lily.github.io/spider): A complex text2sql dataset across domains, containing 10,181 natural language queries, 5,693 SQL distributed across 200 separate databases, covering 138 different domains.[download link](https://drive.google.com/uc?export=download&id=1TqleXec_OykOYFREKKtschzY29dUcVAQ) - [WikiSQL:](https://github.com/salesforce/WikiSQL) A large semantic parsing dataset consisting of 80,654 natural statement expressions and sql annotations of 24,241 tables. Each query in WikiSQL is limited to the same table and does not contain complex operations such as sorting, grouping The queries in WikiSQL are limited to the same table and do not include complex operations such as sorting, grouping, subqueries, etc. - [CHASE](https://xjtu-intsoft.github.io/chase/): A cross-domain multi-round interactive text2sql Chinese dataset containing a list of 5,459 multi-round questions consisting of 17,940 binary groups across 280 different domain databases. - [BIRD-SQL:](https://bird-bench.github.io/) A large-scale cross-domain text-to-SQL benchmark in English, with a particular focus on large database content. The dataset contains 12,751 text-to-SQL data pairs and 95 databases with a total size of 33.4 GB across 37 occupational domains. The BIRD-SQL dataset bridges the gap between text-to-SQL research and real-world applications by exploring three additional challenges, namely dealing with large and messy database values, external knowledge inference and optimising SQL execution efficiency. - [CoSQL:](https://yale-lily.github.io/cosql) A corpus for building cross-domain conversational text-to-SQL systems. It is a conversational version of the Spider and SParC tasks. CoSQL consists of 30k+ rounds and 10k+ annotated SQL queries from Wizard-of-Oz's collection of 3k conversations querying 200 complex databases across 138 domains. Each conversation simulates a realistic DB query scenario in which a staff member explores the database as a user and a SQL expert uses SQL to retrieve answers, clarify ambiguous questions, or otherwise inform. +- Following the processing template of [NSQL](https://github.com/NumbersStationAI/NSQL), the dataset underwent basic processing, yielding approximately [20K dataset](https://huggingface.co/datasets/Healthy13/Text2SQL/tree/main) @@ -52,26 +60,17 @@ DB-GPT-Hub currently supports the following base models: - [x] XVERSE - [x] ChatGLM2 - [x] internlm - -The approximate hardware resources required to quantize and fine-tune the model are as follows: - -| Model Parameters | GPU RAM | CPU RAM | DISK | -| ---------------- | -------------- | ------- | ------ | -| 7b | 4.8GB (14.7GB) | 3.6GB | 36.4GB | -| 13b | 8.4GB (28.7GB) | 5.9GB | 60.2GB | -| 33b | 18.3GB (OOM) | 8.4GB | 122GB | -| 65b | 38.7GB (OOM) | 13.1GB | 434GB | -### 2.3. Fine-tuning methods -#### Spider+QLoRA/LoRA+LLM(Falcon/Vicuna/Guanaco/LLaMa2/CodeLlama) +The model is fine-tuned based on a quantization bit of 4 using Quantized Learning over Redundant Architecture (QLoRA). The minimum hardware requirements for this can be referred to as follows: -This experimental project builds a dataset by adding table structure information, adjusting the parameters of the language model and then fine-tuning the LLM with QLoRA/LoRA, aiming to reduce the cost of fine-tuning while increasing the accuracy and speed of SQL generation. This can be executed with the following command: +| Model Parameters | GPU RAM | CPU RAM | DISK | +| -------- | --------------- | ------- | ------ | +| 7b | 6GB | 3.6GB | 36.4GB | +| 13b | 13.4GB | 5.9GB | 60.2GB | + +All the related parameters are set to the minimum, with a batch size of 1 and max length of 512. Based on experience, for better performance, it is recommended to set the related length values to 1024 or 2048. -```shell -sh scripts/qlora/qlora.sh -sh scripts/lora/lora.sh -``` ## 3. Usage @@ -85,121 +84,122 @@ conda activate dbgpt_hub pip install -r requirements.txt mkdir model ``` -Put the model files under the new Model folder here ### 3.2. Data preparation -DB-GPT-HUB uses the information matching generation method for data preparation, i.e. the SQL + Repository generation method that combines table information. This method combines data table information to better understand the structure and relationships of the data table, and is suitable for generating SQL statements that meet the requirements. - -Before running, you need to download the SQL data set and put it in this directory. Here, take the spider data set as an example. The spider data set consists of three main parts: - -* train_spide.json: each text-to-SQL QA data and database related data is stored as a json file - * db_id: the name of the database - * question: the command issued to the database in natural language - * query: sql code that accepts the natural language command and executes it exactly -* train_gold.sql: the real sql code for the question -* database: the database source file - * schema.sql: the table build statement. - * sqlite: the specifics of the database. - -First we need to extract all the information from the above data such as QA, table structure and database content in the following format: - -``` -{ - "query": sample["query"]. - "question": sample["question"]. - "db_id": db_id. - "db_path": db_path. - "db_table_names": schema["table_names_original"]. - "db_column_names": [ - {"table_id": table_id, "column_name": column_name} - for table_id, column_name in schema["column_names_original"] - ]. - "db_column_types": schema["column_types"]. - "db_primary_keys": [{"column_id": column_id} for column_id in schema["primary_keys"]]. - "db_foreign_keys": [ - {"column_id": column_id, "other_column_id": other_column_id} - for column_id, other_column_id in schema["foreign_keys"] - ]. - } -``` +DB-GPT-Hub uses the information matching generation method for data preparation, i.e. the SQL + Repository generation method that combines table information. This method combines data table information to better understand the structure and relationships of the data table, and is suitable for generating SQL statements that meet the requirements. -This data is then expressed in natural language, e.g: +Download the [Spider dataset]((https://drive.google.com/uc?export=download&id=1TqleXec_OykOYFREKKtschzY29dUcVAQ)) from the Spider dataset link. By default, after downloading and extracting the data, place it in the dbgpt_hub/data directory, i.e., the path should be `dbgpt_hub/data/spider`. -``` -{"instruction": "department_management contains tables such as department, head, management. Table department has columns such as department_id, name, creation, ranking, budget_in_billions, num_employees. department_id is the primary key. Table head has columns such as head_id, name, born_state, age. head_id is the primary key. Table management has columns such as department_id, head_id, temporary_acting. department_id is the primary key. The head_id of management is the foreign key of head_id of head. The department_id of management is the foreign key of department_id of department.", -"input": "How many heads of the departments are older than 56 ?", -"output": "select count(*) from head where age > 56"} +For the data preprocessing part, simply **run the following script** : +```bash +## generate train and dev(eval) data +sh dbgpt_hub/scripts/train_eval_data_gen.sh ``` - You can from the [link](https://drive.google.com/uc?export=download&id=1TqleXec_OykOYFREKKtschzY29dUcVAQ) download the spider data,By default, after Unzip the data and place it under the directory dbgpt_hub/data, which means the path is dbgpt_hub/data/spider. +In the directory `dbgpt_hub/data/`, you will find the newly generated training file example_text2sql_train.json and testing file example_text2sql_dev.json, containing 8659 and 1034 entries respectively. -The code implementation of the above data pre-processing section is as follows: - -```bash -## Generate train data and dev data -sh dbgpt_hub/scripts/train_eval_data_gen.sh +The data in the generated JSON looks something like this: ``` -In the dbgpt_hub/data directory, you will obtain the newly generated training file example_text2sql_train.json and the testing file example_text2sql_dev.json, with data counts of 8659 and 1034 respectively. - -When fine-tuning the model, we also customize the prompt dict to optimize the input: - -``` python -SQL_PROMPT_DICT = { - "prompt_input": ( - "I want you to act as a SQL terminal in front of an example database, \ - you need only to return the sql command to me.Below is an instruction that describes a task, \ - Write a response that appropriately completes the request.\n" \ - "##Instruction:\n{instruction}\n###Input:\n{input}\n\n###Response:" - ), - "prompt_no_input": ( - "I want you to act as a SQL terminal in front of an example database, \ - you need only to return the sql command to me.Below is an instruction that describes a task, \ - Write a response that appropriately completes the request.\n" \ - "####Instruction:\n{instruction}\n\###Response: " - ), -} + { + "db_id": "department_management", + "instruction": "I want you to act as a SQL terminal in front of an example database, you need only to return the sql command to me.Below is an instruction that describes a task, Write a response that appropriately completes the request.\n\"\n##Instruction:\ndepartment_management contains tables such as department, head, management. Table department has columns such as Department_ID, Name, Creation, Ranking, Budget_in_Billions, Num_Employees. Department_ID is the primary key.\nTable head has columns such as head_ID, name, born_state, age. head_ID is the primary key.\nTable management has columns such as department_ID, head_ID, temporary_acting. department_ID is the primary key.\nThe head_ID of management is the foreign key of head_ID of head.\nThe department_ID of management is the foreign key of Department_ID of department.\n\n", + "input": "###Input:\nHow many heads of the departments are older than 56 ?\n\n###Response:", + "output": "SELECT count(*) FROM head WHERE age > 56", + "history": [] + }, +``` -``` ### 3.3. Model fine-tuning -The model fine-tuning supports both qlora and lora methods. We can run the following command to fine-tune the model. By default, with the parameter --quantization_bit, it uses the qlora fine-tuning method. To switch to lora, simply remove the related parameter from the script. +The model fine-tuning supports both LoRA and QLoRA methods. We can run the following command to fine-tune the model. By default, with the parameter --quantization_bit, it uses the QLoRA fine-tuning method. To switch to LoRAs, simply remove the related parameter from the script. Run the command: ```bash sh dbgpt_hub/scripts/train_sft.sh ``` -After fine-tuning, the model weights will be saved by default in the adapter folder, specifically in the dbgpt_hub/output/adapter directory. +After fine-tuning, the model weights will be saved by default in the adapter folder, specifically in the dbgpt_hub/output/adapter directory. + +If you're using **multi-GPU training and want to utilize deepseed**, you should modify the default content in train_sft.sh. The change is: + +``` +CUDA_VISIBLE_DEVICES=0 python dbgpt_hub/train/sft_train.py \ + --quantization_bit 4 \ + ... +``` +change to : +``` +deepspeed --num_gpus 2 dbgpt_hub/train/sft_train.py \ + --deepspeed dbgpt_hub/configs/ds_config.json \ + --quantization_bit 4 \ + ... +``` + +The other parts that are omitted (…) can be kept consistent. If you want to change the default deepseed configuration, go into the `dbgpt_hub/configs` directory and make changes to ds_config.json as needed. + +In the script, during fine-tuning, different models correspond to key parameters lora_target and template, as shown in the following table: + +| model name | lora_target | template | +| -------------------------------------------------------- | ----------------- |----------| +| [LLaMA-2](https://huggingface.co/meta-llama) | q_proj,v_proj | llama2 | +| [CodeLlama-2](https://huggingface.co/codellama/) | q_proj,v_proj | llama2 | +| [Baichuan2](https://github.com/baichuan-inc/Baichuan2) | W_pack | baichuan2 | +| [InternLM](https://github.com/InternLM/InternLM) | q_proj,v_proj | intern | +| [Qwen](https://github.com/QwenLM/Qwen-7B) | c_attn | chatml | +| [XVERSE](https://github.com/xverse-ai/XVERSE-13B) | q_proj,v_proj | xverse | +| [ChatGLM2](https://github.com/THUDM/ChatGLM2-6B) | query_key_value | chatglm2 | +| [LLaMA](https://github.com/facebookresearch/llama) | q_proj,v_proj | - | +| [BLOOM](https://huggingface.co/bigscience/bloom) | query_key_value | - | +| [BLOOMZ](https://huggingface.co/bigscience/bloomz) | query_key_value | - | +| [Baichuan](https://github.com/baichuan-inc/baichuan-13B) | W_pack | baichuan | +| [Falcon](https://huggingface.co/tiiuae/falcon-7b) | query_key_value | - | + + In `train_sft.sh` , other key parameters are as follows: + + > quantization_bit: Indicates whether quantization is applied, with valid values being [4 or 8]. +> model_name_or_path: The path of the LLM (Large Language Model). +> dataset: Specifies the name of the training dataset configuration, corresponding to the outer key value in dbgpt_hub/data/dataset_info.json, such as example_text2sql. +> max_source_length: The length of the text input into the model. If computing resources allow, it can be set as large as possible, like 1024 or 2048. +> max_target_length: The length of the SQL content output by the model; 512 is generally sufficient. +> output_dir: The output path of the Peft module during SFT (Supervised Fine-Tuning), set by default to `dbgpt_hub/output/adapter/` . +> per_device_train_batch_size: The size of the batch. If computing resources allow, it can be set larger; the default is 1. +> gradient_accumulation_steps: The number of steps for accumulating gradients before an update. +> save_steps: The number of steps at which model checkpoints are saved; it can be set to 100 by default. +> num_train_epochs: The number of epochs for training the dataset. + ### 3.4. Model Predict -Under the project directory ./dbgpt_hub/output/pred/, this folder is the default output location for model predictions. +Under the project directory ./dbgpt_hub/output/pred/, this folder is the default output location for model predictions(if not exist, just mkdir). ```bash sh ./dbgpt_hub/scripts/predict_sft.sh ``` -In the script, by default with the parameter --quantization_bit, it predicts using QLoRA. Removing it switches to the LoRA prediction method. +In the script, by default with the parameter `--quantization_bit`, it predicts using QLoRA. Removing it switches to the LoRA prediction method. +The value of the parameter `--predicted_out_filename` is the file name of the model's predicted results, which can be found in the `dbgpt_hub/output/pred` directory. ### 3.5 Model Weights You can find the corresponding model weights we uploaded in August from Huggingface.[hg-eosphoros-ai ](https://huggingface.co/eosphoros) -We will release a better version of the new weights as soon as possible. + We will soon release new model and improved weights that outperform GPT-4 in accuracy on the spider evaluation set. -## 3.5.2 Model and fine-tuned weight merging +#### 3.5.1 Model and fine-tuned weight merging -Run the following script, and be sure to replace the relevant parameter path values ​​in the script with the path corresponding to your project. +If you need to merge the weights of the trained base model and the fine-tuned Peft module to export a complete model, execute the following model export script: ```bash sh ./dbgpt_hub/scripts/export_merge.sh ``` +Be sure to replace the parameter path values in the script with the paths corresponding to your project. + ### 3.6 Model Evaluation -To evaluate model performance on the dataset, default is spider dataset. +To evaluate model performance on the dataset, default is spider dev dataset. Run the following command: ```bash python dbgpt_hub/eval/evaluation.py --plug_value --input Your_model_pred_file @@ -223,9 +223,9 @@ The whole process we will divide into three phases: - [x] internlm * Stage 2: - * Optimize model performance, support fine-tuning more different models in various ways. + * Optimize model performance, and support fine-tuning more different models in various ways before `20231010` * Optimize `prompts` - * Release evaluation results, optimize `DB-GPT-SFT` models + * Release evaluation results, and optimized models open to peers. * Stage 3: * Optimized based on more papers, such as RESDSQL and others. Combined with our community's sibling project[Awesome-Text2SQL](https://github.com/eosphoros-ai/Awesome-Text2SQL)for further enhancements.. diff --git a/README.zh.md b/README.zh.md index 121feae..afd5918 100644 --- a/README.zh.md +++ b/README.zh.md @@ -21,26 +21,32 @@ ## 一、简介 -DB-GPT-Hub是一个利用LLMs实现Text-to-SQL解析的实验项目,主要包含数据集收集、数据预处理、模型选择与构建和微调权重等步骤,通过这一系列的处理可以在提高Text-to-SQL能力的同时降低模型训练成本,让更多的开发者参与到Text-to-SQL的准确度提升工作当中,最终实现基于数据库的自动问答能力,让用户可以通过自然语言描述完成复杂数据库的查询操作等工作。 +DB-GPT-Hub是一个利用LLMs实现Text-to-SQL解析的实验项目,主要包含数据集收集、数据预处理、模型选择与构建和微调权重等步骤,通过这一系列的处理可以在提高Text-to-SQL能力的同时降低模型训练成本,让更多的开发者参与到Text-to-SQL的准确度提升工作当中,最终实现基于数据库的自动问答能力,让用户可以通过自然语言描述完成复杂数据库的查询操作等工作。 +目前我们已经基于多个大模型打通从数据处理、模型SFT训练、预测输出和评估的整个流程,**代码在本项目中均可以直接复用**。 +截止20231010,我们利用本项目基于开源的13B大小的模型微调后,在Spider的评估集上的执行准确率,**已经超越GPT-4!** ## 二、Text-to-SQL微调 -大型语言模型(LLMs)在现有Text-to-SQL的基准测试中取得了令人印象深刻的成果。然而,这些模型在面对大型数据库和嘈杂内容时仍然存在挑战,而且巨大的数据库价值背后的奥秘需要外部知识和推理来揭示。 我们基于大语言模型持续的SFT来提升Text-to-SQL的效果 + 我们基于大语言模型的SFT来提升Text-to-SQL的效果。 ### 2.1、数据集 -本项目主要使用了以下公开的text2sql数据集: +本项目案例数据主要以**Spider**数据集为示例 : - [Spider](https://yale-lily.github.io/spider): 一个跨域的复杂text2sql数据集,包含了10,181条自然语言问句、分布在200个独立数据库中的5,693条SQL,内容覆盖了138个不同的领域。[下载链接](https://drive.google.com/uc?export=download&id=1TqleXec_OykOYFREKKtschzY29dUcVAQ) + +其他数据集: + - [WikiSQL:](https://github.com/salesforce/WikiSQL) 一个大型的语义解析数据集,由80,654个自然语句表述和24,241张表格的sql标注构成。WikiSQL中每一个问句的查询范围仅限于同一张表,不包含排序、分组、子查询等复杂操作。 - [CHASE](https://xjtu-intsoft.github.io/chase/): 一个跨领域多轮交互text2sql中文数据集,包含5459个多轮问题组成的列表,一共17940个二元组,涉及280个不同领域的数据库。 - [BIRD-SQL:](https://bird-bench.github.io/)数据集是一个英文的大规模跨领域文本到SQL基准测试,特别关注大型数据库内容。该数据集包含12,751对文本到SQL数据对和95个数据库,总大小为33.4GB,跨越37个职业领域。BIRD-SQL数据集通过探索三个额外的挑战,即处理大规模和混乱的数据库值、外部知识推理和优化SQL执行效率,缩小了文本到SQL研究与实际应用之间的差距。 - [CoSQL:](https://yale-lily.github.io/cosql)是一个用于构建跨域对话文本到sql系统的语料库。它是Spider和SParC任务的对话版本。CoSQL由30k+回合和10k+带注释的SQL查询组成,这些查询来自Wizard-of-Oz的3k个对话集合,查询了跨越138个领域的200个复杂数据库。每个对话都模拟了一个真实的DB查询场景,其中一个工作人员作为用户探索数据库,一个SQL专家使用SQL检索答案,澄清模棱两可的问题,或者以其他方式通知。 +- 按照[NSQL](https://github.com/NumbersStationAI/NSQL)的处理模板,对数据集做简单处理,共得到约[20w条训练数据](https://huggingface.co/datasets/Healthy13/Text2SQL/tree/main) ### 2.2、基座模型 -DB-GPT-HUB目前支持的base模型有: +DB-GPT-HUB目前已经支持的base模型有: - [x] CodeLlama - [x] Baichuan2 @@ -50,27 +56,18 @@ DB-GPT-HUB目前支持的base模型有: - [x] XVERSE - [x] ChatGLM2 - [x] internlm + - [x] Falcon + -模型量化微调所需的硬件资源大概如下: +模型可以基于quantization_bit为4的量化微调(QLoRA)所需的最低硬件资源,可以参考如下: | 模型参数 | GPU RAM | CPU RAM | DISK | | -------- | --------------- | ------- | ------ | -| 7b | 4.8GB(14.7GB) | 3.6GB | 36.4GB | -| 13b | 8.4GB(28.7GB) | 5.9GB | 60.2GB | -| 33b | 18.3GB(OOM) | 8.4GB | 122GB | -| 65b | 38.7GB(OOM) | 13.1GB | 434GB | +| 7b | 6GB | 3.6GB | 36.4GB | +| 13b | 13.4GB | 5.9GB | 60.2GB | -### 2.3、微调方法 +其中相关参数均设置的为最小,batch_size为1,max_length为512。根据经验,如果计算资源足够,为了效果更好,建议相关长度值设置为1024或者2048。 -#### Spider+QLoRA/LoRA+LLM(Falcon/Vicuna/Guanaco/LLaMa/LLaMa2/CodeLlama) - -该实验项目通过加入表结构信息、调整语言模型的参数等方式构建数据集,然后用QLoRA/LoRA对LLM模型进行微调,旨在降低微调成本的同时提高SQL生成的准确性和速度。可以通过以下命令来执行: - -```shell -sh scripts/qlora/qlora.sh -sh scripts/lora/lora.sh 或者 sh scripts/lora/lora_ds.sh -``` -其中lora.sh和lora_ds.sh的区别主要是用deepspeed(ds)版本。 ## 三、使用方法 ### 3.1、环境准备 @@ -82,113 +79,108 @@ conda create -n dbgpt_hub python=3.10 conda activate dbgpt_hub pip install -r requirements.txt ``` -你可以将下载的大模型文件放在新建model文件夹下面 ### 3.2、数据准备 -DB-GPT-HUB使用的是信息匹配生成法进行数据准备,即结合表信息的 SQL + Repository 生成方式,这种方式结合了数据表信息,能够更好地理解数据表的结构和关系,适用于生成符合需求的 SQL 语句。 - -运行前需要将SQL数据集下载后放在该目录下。这里以spider数据集为例,spider数据集主要包含三部分: +DB-GPT-Hub使用的是信息匹配生成法进行数据准备,即结合表信息的 SQL + Repository 生成方式,这种方式结合了数据表信息,能够更好地理解数据表的结构和关系,适用于生成符合需求的 SQL 语句。 +从[spider数据集链接](https://drive.google.com/uc?export=download&id=1TqleXec_OykOYFREKKtschzY29dUcVAQ) 下载spider数据集,默认将数据下载解压后,放在目录dbgpt_hub/data下面,即路径为`dbgpt_hub/data/spider`。 -* train_spide.json:每条text-to-SQL的QA数据与数据库相关数据存储为json文件 - * db_id:数据库名称 - * question: 以自然语言的方式向数据库发出的指令 - * query:接受自然语言指令后,能够准确执行指令的sql代码 -* train_gold.sql:question的真实sql代码 -* database:数据库源文件 - * schema.sql: 建表语句。 - * sqlite: 数据库的具体内容。 - -首先我们需要将以上数据中的QA、表结构和数据库内容等都信息提取出来,格式如下: - -``` -{ - "query": sample["query"], - "question": sample["question"], - "db_id": db_id, - "db_path": db_path, - "db_table_names": schema["table_names_original"], - "db_column_names": [ - {"table_id": table_id, "column_name": column_name} - for table_id, column_name in schema["column_names_original"] - ], - "db_column_types": schema["column_types"], - "db_primary_keys": [{"column_id": column_id} for column_id in schema["primary_keys"]], - "db_foreign_keys": [ - {"column_id": column_id, "other_column_id": other_column_id} - for column_id, other_column_id in schema["foreign_keys"] - ], - } -``` - -然后将该数据以自然语言的形式表述,例如: - -``` -{"instruction": "department_management contains tables such as department, head, management. Table department has columns such as department_id, name, creation, ranking, budget_in_billions, num_employees. department_id is the primary key. Table head has columns such as head_id, name, born_state, age. head_id is the primary key. Table management has columns such as department_id, head_id, temporary_acting. department_id is the primary key. The head_id of management is the foreign key of head_id of head. The department_id of management is the foreign key of department_id of department.", -"input": "How many heads of the departments are older than 56 ?", -"output": "select count(*) from head where age > 56"} - -``` - -从[下载链接](https://drive.google.com/uc?export=download&id=1TqleXec_OykOYFREKKtschzY29dUcVAQ) 下载spider数据集,默认将数据下载解压后,放在目录dbgpt_hub/data下面,即路径为dbgpt_hub/data/spider - -数据预处理部分的代码实现如下: +数据预处理部分,**只需运行如下脚本**即可: ```bash -## 生成train数据 和dev数据, +## 生成train数据 和dev(eval)数据, sh dbgpt_hub/scripts/train_eval_data_gen.sh ``` -在dbgpt_hub/data目录你会得到新生成的训练文件example_text2sql_train.json 和测试文件example_text2sql_dev.json ,数据量分别为8659和1034条。 - -在模型微调时,我们还定制了prompt dict以优化输入: - -```python -SQL_PROMPT_DICT = { - "prompt_input": ( - "I want you to act as a SQL terminal in front of an example database, \ - you need only to return the sql command to me.Below is an instruction that describes a task, \ - Write a response that appropriately completes the request.\n" \ - "##Instruction:\n{instruction}\n###Input:\n{input}\n\n###Response:" - ), - "prompt_no_input": ( - "I want you to act as a SQL terminal in front of an example database, \ - you need only to return the sql command to me.Below is an instruction that describes a task, \ - Write a response that appropriately completes the request.\n" \ - "####Instruction:\n{instruction}\n\###Response: " - ), -} +在`dbgpt_hub/data/`目录你会得到新生成的训练文件example_text2sql_train.json 和测试文件example_text2sql_dev.json ,数据量分别为8659和1034条。 + +生成的json中的数据形如: ``` + { + "db_id": "department_management", + "instruction": "I want you to act as a SQL terminal in front of an example database, you need only to return the sql command to me.Below is an instruction that describes a task, Write a response that appropriately completes the request.\n\"\n##Instruction:\ndepartment_management contains tables such as department, head, management. Table department has columns such as Department_ID, Name, Creation, Ranking, Budget_in_Billions, Num_Employees. Department_ID is the primary key.\nTable head has columns such as head_ID, name, born_state, age. head_ID is the primary key.\nTable management has columns such as department_ID, head_ID, temporary_acting. department_ID is the primary key.\nThe head_ID of management is the foreign key of head_ID of head.\nThe department_ID of management is the foreign key of Department_ID of department.\n\n", + "input": "###Input:\nHow many heads of the departments are older than 56 ?\n\n###Response:", + "output": "SELECT count(*) FROM head WHERE age > 56", + "history": [] + }, +``` + ### 3.3、模型微调 -模型微调支持qlora和lora方法,我们可以运行以下命令来微调模型,默认带着参数`--quantization_bit `为qlora的微调方式,转换为lora只需在脚本中去掉相关参数即可。 -运行命令: +本项目微调不仅能支持QLoRA和LoRA法,还支持deepseed。 可以运行以下命令来微调模型,默认带着参数`--quantization_bit `为QLoRA的微调方式,如果想要转换为lora的微调,只需在脚本中去掉quantization_bit参数即可。 +默认QLoRA微调,运行命令: ```bash sh dbgpt_hub/scripts/train_sft.sh ``` +微调后的模型权重会默认保存到adapter文件夹下面,即dbgpt_hub/output/adapter目录中。 +**如果使用多卡训练,想要用deepseed** ,则将train_sft.sh中默认的内容进行更改, +调整为: -微调后的模型权重会默认保存到adapter文件夹下面,即dbgpt_hub/output/adapter目录中。 +``` +CUDA_VISIBLE_DEVICES=0 python dbgpt_hub/train/sft_train.py \ + --quantization_bit 4 \ + ... +``` +更改为: +``` +deepspeed --num_gpus 2 dbgpt_hub/train/sft_train.py \ + --deepspeed dbgpt_hub/configs/ds_config.json \ + --quantization_bit 4 \ + ... +``` +其他省略(...)的部分均保持一致即可。 如果想要更改默认的deepseed配置,进入 `dbgpt_hub/configs` 目录,在ds_config.json 更改即可。 + +脚本中微调时不同模型对应的关键参数lora_target 和 template,如下表: + +| 模型名 | lora_target | template | +| -------------------------------------------------------- | ----------------- |----------| +| [LLaMA-2](https://huggingface.co/meta-llama) | q_proj,v_proj | llama2 | +| [CodeLlama-2](https://huggingface.co/codellama/) | q_proj,v_proj | llama2 | +| [Baichuan2](https://github.com/baichuan-inc/Baichuan2) | W_pack | baichuan2 | +| [InternLM](https://github.com/InternLM/InternLM) | q_proj,v_proj | intern | +| [Qwen](https://github.com/QwenLM/Qwen-7B) | c_attn | chatml | +| [XVERSE](https://github.com/xverse-ai/XVERSE-13B) | q_proj,v_proj | xverse | +| [ChatGLM2](https://github.com/THUDM/ChatGLM2-6B) | query_key_value | chatglm2 | +| [LLaMA](https://github.com/facebookresearch/llama) | q_proj,v_proj | - | +| [BLOOM](https://huggingface.co/bigscience/bloom) | query_key_value | - | +| [BLOOMZ](https://huggingface.co/bigscience/bloomz) | query_key_value | - | +| [Baichuan](https://github.com/baichuan-inc/baichuan-13B) | W_pack | baichuan | +| [Falcon](https://huggingface.co/tiiuae/falcon-7b) | query_key_value | - | + +`train_sft.sh`中其他关键参数含义: +> quantization_bit:是否量化,取值为[4或者8] +> model_name_or_path: LLM模型的路径 +> dataset: 取值为训练数据集的配置名字,对应在dbgpt_hub/data/dataset_info.json 中外层key值,如example_text2sql。 +> max_source_length: 输入模型的文本长度,如果计算资源支持,可以尽能设大,如1024或者2048。 +> max_target_length: 输出模型的sql内容长度,设置为512一般足够。 +> output_dir : SFT微调时Peft模块输出的路径,默认设置在dbgpt_hub/output/adapter/路径下 。 +> per_device_train_batch_size : batch的大小,如果计算资源支持,可以设置为更大,默认为1。 +> gradient_accumulation_steps : 梯度更新的累计steps值 +> save_steps : 模型保存的ckpt的steps大小值,默认可以设置为100。 +> num_train_epochs : 训练数据的epoch数 -### 3.4、模型预测 -项目目录下`./dbgpt_hub/output/pred/`,此文件夹为关于模型预测默认输出的位置。 +### 3.4、模型预测 +项目目录下`./dbgpt_hub/`下的`output/pred/`,此文件路径为关于模型预测结果默认输出的位置(如果没有则建上)。 +预测运行命令: ```bash sh ./dbgpt_hub/scripts/predict_sft.sh -``` +``` +脚本中默认带着参数`--quantization_bit `为QLoRA的预测,去掉即为LoRA的预测方式。 +其中参数 `--predicted_out_filename` 的值为模型预测的结果文件名,结果在`dbgpt_hub/output/pred`目录下可以找到。 -脚本中默认带着参数`--quantization_bit `为QLoRA的预测,去掉即为LoRA的预测方式。 +### 3.5、模型权重 +可以从Huggingface查看我们之前8月份上传的对应的Peft模块的权重[huggingface地址](https://huggingface.co/eosphoros) 。 新的更好的在spider的评估集上执行准确率超越GPT-4的权重我们将尽快释放出。 -# 3.5、模型权重 -可以从Huggingface查看我们之前8月份上传的对应的模型权重。 [huggingface地址](https://huggingface.co/eosphoros) -新的权重我们将尽快释放出一版效果更好的。 -## 3.5.2 模型和微调权重合并 -运行如下脚本,注意将脚本中的相关参数路径值替换为你项目所对应的路径。 +#### 3.5.1 模型和微调权重合并 +如果你需要将训练的基础模型和微调的Peft模块的权重合并,导出一个完整的模型。则运行如下模型导出脚本: ```bash sh ./dbgpt_hub/scripts/export_merge.sh ``` +注意将脚本中的相关参数路径值替换为你项目所对应的路径。 ### 3.6、模型评估 @@ -198,7 +190,7 @@ sh ./dbgpt_hub/scripts/export_merge.sh ```bash python dbgpt_hub/eval/evaluation.py --plug_value --input Your_model_pred_file ``` -你可以在[这里](docs/eval_llm_result.md)找到我们最新的审查结果。 +你可以在[这里](docs/eval_llm_result.md)找到我们最新的评估结果。 ## 四、发展路线 整个过程我们会分为三个阶段: @@ -216,15 +208,11 @@ python dbgpt_hub/eval/evaluation.py --plug_value --input Your_model_pred_file - [x] internlm - - - We preliminarily plan to support the following models going forward. If there are new and better models, we'll keep an eye out and follow up too. Feel free to open an issue to suggest any, we'll glad to see your issues. - * 阶段二: - * 优化模型效果,支持更多不同模型进行不同方式的微调。 - * 对`prompt`优化 - * 放出评估效果,优化后`DB-GPT-SFT`模型 + - [x] 优化模型效果,支持更多不同模型进行不同方式的微调。截止`20231010`,我们已经完成对项目代码的重构,支持更多的模型。 + - [x] 对`prompt`优化 + * 放出评估效果,和优化后的还不错的模型 * 阶段三: * 基于更多论文进行优化,如`RESDSQL`等,结合我们社区的兄弟项目[Awesome-Text2SQL](https://github.com/eosphoros-ai/Awesome-Text2SQL)进行更多的优化; diff --git a/README_zh_0925.md b/README_zh_0925.md deleted file mode 100644 index 0fcff20..0000000 --- a/README_zh_0925.md +++ /dev/null @@ -1,268 +0,0 @@ -# LLM Efficient Tuning (Text2SQL示例) - - -## 支持的模型 - -| 模型名 | 模型大小 | 默认模块 | Template | -| -------------------------------------------------------- | --------------------------- | ----------------- |----------| -| [LLaMA](https://github.com/facebookresearch/llama) | 7B/13B/33B/65B | q_proj,v_proj | - | -| [LLaMA-2](https://huggingface.co/meta-llama) | 7B/13B/70B | q_proj,v_proj | llama2 | -| [BLOOM](https://huggingface.co/bigscience/bloom) | 560M/1.1B/1.7B/3B/7.1B/176B | query_key_value | - | -| [BLOOMZ](https://huggingface.co/bigscience/bloomz) | 560M/1.1B/1.7B/3B/7.1B/176B | query_key_value | - | -| [Falcon](https://huggingface.co/tiiuae/falcon-7b) | 7B/40B | query_key_value | - | -| [Baichuan](https://github.com/baichuan-inc/baichuan-13B) | 7B/13B | W_pack | baichuan | -| [Baichuan2](https://github.com/baichuan-inc/Baichuan2) | 7B/13B | W_pack | baichuan2 | -| [InternLM](https://github.com/InternLM/InternLM) | 7B | q_proj,v_proj | intern | -| [Qwen](https://github.com/QwenLM/Qwen-7B) | 7B | c_attn | chatml | -| [XVERSE](https://github.com/xverse-ai/XVERSE-13B) | 13B | q_proj,v_proj | xverse | -| [ChatGLM2](https://github.com/THUDM/ChatGLM2-6B) | 6B | query_key_value | chatglm2 | - -- **默认模块**是 `--lora_target` 参数的部分可选项。请使用 `python src/train_bash.py -h` 查看全部可选项。 -- 对于所有“基座”(Base)模型,`--template` 参数可以是 `default`, `alpaca`, `vicuna` 等任意值。但“对话”(Chat)模型请务必使用对应的模板。 - -## 训练方法 - -- 支持全参数训练(full)、部分参数训练(freeze)、lora、qlora - - 修改 `--finetuning_type` 参数选择训练方法["lora", "freeze", "full"]; - - 使用 `--quantization_bit 4/8` 参数来启用 qlora 训练。 - -## 数据集 - -- 本示例中仅使用指令监督微调(SFT),数据集来自[NSQL](https://github.com/NumbersStationAI/NSQL), 以及NSQL中不包含的数据集(BIRD, CHASE, cosql) -- 按照NSQL的处理模板,对数据集做预处理,共得到约[20w条训练数据](https://huggingface.co/datasets/Healthy13/Text2SQL/tree/main) -- 模板格式如下: - - ```json - { -       "db_id": "database", -       "instruction": "CREATE TABLE mountain (\nmountain_name,\nmountain_altitude,\nstate_name,\ncountry_name\n)\n\nCREATE TABLE city (\ncity_name,\nstate_name,\npopulation,\ncountry_name\n)\n\nCREATE TABLE road (\nroad_name,\nstate_name\n)\n\nCREATE TABLE border_info (\nstate_name,\nborder\n)\n\nCREATE TABLE river (\nriver_name,\nlength,\ntraverse,\ncountry_name\n)\n\nCREATE TABLE state (\nstate_name,\ncapital,\npopulation,\narea,\ncountry_name,\ndensity\n)\n\nCREATE TABLE highlow (\nstate_name,\nhighest_point,\nhighest_elevation,\nlowest_point,\nlowest_elevation\n)\n\nCREATE TABLE lake (\nlake_name,\narea,\nstate_name,\ncountry_name\n)\n\n\n-- Using valid SQLite, answer the following questions for the tables provided above.\n\n-- which states border arizona\nSELECT", -       "input": "", -       "output": "SELECT border FROM border_info WHERE state_name = 'arizona'", -       "history": [] - } - ``` - -| Datasets | License | Link | -| ---------------------- | ------------ | -------------------------------------------------------------------------------------------------------------------- | -| academic | Not Found | [https://github.com/jkkummerfeld/text2sql-data](https://github.com/jkkummerfeld/text2sql-data) | -| advising | CC-BY-4.0 | [https://github.com/jkkummerfeld/text2sql-data](https://github.com/jkkummerfeld/text2sql-data) | -| atis | Not Found | [https://github.com/jkkummerfeld/text2sql-data](https://github.com/jkkummerfeld/text2sql-data) | -| restaurants | Not Found | [https://github.com/jkkummerfeld/text2sql-data](https://github.com/jkkummerfeld/text2sql-data) | -| scholar | Not Found | [https://github.com/jkkummerfeld/text2sql-data](https://github.com/jkkummerfeld/text2sql-data) | -| imdb | Not Found | [https://github.com/jkkummerfeld/text2sql-data](https://github.com/jkkummerfeld/text2sql-data) | -| yelp | Not Found | [https://github.com/jkkummerfeld/text2sql-data](https://github.com/jkkummerfeld/text2sql-data) | -| criteria2sql | Apache-2.0 | [https://github.com/xiaojingyu92/Criteria2SQL](https://github.com/xiaojingyu92/Criteria2SQL) | -| css | CC-BY-4.0 | [https://huggingface.co/datasets/zhanghanchong/css](https://huggingface.co/datasets/zhanghanchong/css) | -| eICU | CC-BY-4.0 | [https://github.com/glee4810/EHRSQL](https://github.com/glee4810/EHRSQL) | -| mimic_iii | CC-BY-4.0 | [https://github.com/glee4810/EHRSQL](https://github.com/glee4810/EHRSQL) | -| geonucleardata | CC-BY-SA-4.0 | [https://github.com/chiahsuan156/KaggleDBQA](https://github.com/chiahsuan156/KaggleDBQA) | -| greatermanchestercrime | CC-BY-SA-4.0 | [https://github.com/chiahsuan156/KaggleDBQA](https://github.com/chiahsuan156/KaggleDBQA) | -| studentmathscore | CC-BY-SA-4.0 | [https://github.com/chiahsuan156/KaggleDBQA](https://github.com/chiahsuan156/KaggleDBQA) | -| thehistoryofbaseball | CC-BY-SA-4.0 | [https://github.com/chiahsuan156/KaggleDBQA](https://github.com/chiahsuan156/KaggleDBQA) | -| uswildfires | CC-BY-SA-4.0 | [https://github.com/chiahsuan156/KaggleDBQA](https://github.com/chiahsuan156/KaggleDBQA) | -| whatcdhiphop | CC-BY-SA-4.0 | [https://github.com/chiahsuan156/KaggleDBQA](https://github.com/chiahsuan156/KaggleDBQA) | -| worldsoccerdatabase | CC-BY-SA-4.0 | [https://github.com/chiahsuan156/KaggleDBQA](https://github.com/chiahsuan156/KaggleDBQA) | -| pesticide | CC-BY-SA-4.0 | [https://github.com/chiahsuan156/KaggleDBQA](https://github.com/chiahsuan156/KaggleDBQA) | -| mimicsql_data | MIT | [https://github.com/wangpinggl/TREQS](https://github.com/wangpinggl/TREQS) | -| nvbench | MIT | [https://github.com/TsinghuaDatabaseGroup/nvBench](https://github.com/TsinghuaDatabaseGroup/nvBench) | -| sede | Apache-2.0 | [https://github.com/hirupert/sede](https://github.com/hirupert/sede) | -| spider | CC-BY-SA-4.0 | [https://huggingface.co/datasets/spider](https://huggingface.co/datasets/spider) | -| sql_create_context | CC-BY-4.0 | [https://huggingface.co/datasets/b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context) | -| squall | CC-BY-SA-4.0 | [https://github.com/tzshi/squall](https://github.com/tzshi/squall) | -| wikisql | BSD 3-Clause | [https://github.com/salesforce/WikiSQL](https://github.com/salesforce/WikiSQL) | -| BIRD | Not Found | https://bird-bench.github.io/ | -| CHASE | MIT LICENSE | https://xjtu-intsoft.github.io/chase/ | -| cosql | Not Found | https://yale-lily.github.io/cosql/ | - - -## 软件依赖 - -- Python 3.8+ 和 PyTorch 1.13.1+ -- 🤗Transformers, Datasets, Accelerate, PEFT 和 TRL -- sentencepiece 和 tiktoken -- jieba, rouge-chinese 和 nltk (用于评估) -- gradio 和 matplotlib (用于网页端交互) -- uvicorn, fastapi 和 sse-starlette (用于 API) - -以及 **强而有力的 GPU**! - -## 如何使用 - -### 数据准备(可跳过) - -关于数据集文件的格式,请参考 `train/data/example_text2sql.json` 文件的内容。构建自定义数据集时,既可以使用单个 `.json` 文件,也可以使用一个[数据加载脚本](https://huggingface.co/docs/datasets/dataset_script)和多个文件。 - -注意:使用自定义数据集时,请更新 `train/data/dataset_info.json` 文件,格式请参考 : - -```json -{ - "text2sql": { - "file_name": "text2sql.json", - "columns": { - "prompt": "instruction", - "query": "input", - "response": "output", - "history": "history" - }, - "stage": "sft" - } -} -``` - -### 环境搭建(可跳过) - -```bash -git clone https://github.com/eosphoros-ai/DB-GPT-Hub -conda create -n text2sql_tuning python=3.10 -conda activate text2sql_tuning -cd DB-GPT-Hub/train -pip install -r requirements.txt -``` - -如果要在 Windows 平台上开启量化 LoRA(QLoRA),需要安装预编译的 `bitsandbytes` 库, 支持 CUDA 11.1 到 12.1. - -```bash -pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.39.1-py3-none-win_amd64.whl -``` - -### 单 GPU 训练 - -#### 指令监督微调Text2sql - -```bash -CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \ - --model_name_or_path Baichuan2-13B-Base \ - --do_train \ - --dataset example_text2sql \ - --max_source_length 1024 \ - --max_target_length 512 \ - --template default \ - --finetuning_type lora \ - --lora_rank 32 \ - --lora_alpha 64 \ - --lora_target W_pack \ - --output_dir path_to_sft_checkpoint \ - --overwrite_cache \ - --per_device_train_batch_size 4 \ - --gradient_accumulation_steps 4 \ - --lr_scheduler_type cosine \ - --logging_steps 10 \ - --save_steps 1000 \ - --learning_rate 5e-5 \ - --num_train_epochs 6.0 \ - --plot_loss \ - --fp16 -``` - -### 多 GPU 分布式训练 - -#### 使用 Huggingface Accelerate - -```bash -accelerate config # 首先配置分布式环境 -accelerate launch src/train_bash.py # 参数同上 -``` - -
使用 DeepSpeed ZeRO-2 进行全参数微调的 Accelerate 配置示例 - -```yaml -compute_environment: LOCAL_MACHINE -deepspeed_config: - gradient_accumulation_steps: 4 - gradient_clipping: 0.5 - offload_optimizer_device: none - offload_param_device: none - zero3_init_flag: false - zero_stage: 2 -distributed_type: DEEPSPEED -downcast_bf16: 'no' -machine_rank: 0 -main_training_function: main -mixed_precision: fp16 -num_machines: 1 -num_processes: 4 -rdzv_backend: static -same_network: true -tpu_env: [] -tpu_use_cluster: false -tpu_use_sudo: false -use_cpu: false -``` - -
- -#### 使用 DeepSpeed -```bash -deepspeed --num_gpus 8 --master_port=9901 src/train_bash.py \ - --deepspeed ds_config.json \ - ... # 参数同上 -``` -或者直接使用如下命令: -```bash -bash train_sft.sh -``` - -
使用 DeepSpeed ZeRO-2 进行全参数微调的 DeepSpeed 配置示例 - -```json -{ - "train_micro_batch_size_per_gpu": "auto", - "gradient_accumulation_steps": "auto", - "gradient_clipping": "auto", - "zero_allow_untested_optimizer": true, - "fp16": { - "enabled": "auto", - "loss_scale": 0, - "initial_scale_power": 16, - "loss_scale_window": 1000, - "hysteresis": 2, - "min_loss_scale": 1 - }, - "zero_optimization": { - "stage": 2, - "allgather_partitions": true, - "allgather_bucket_size": 5e8, - "reduce_scatter": true, - "reduce_bucket_size": 5e8, - "overlap_comm": false, - "contiguous_gradients": true - } -} -``` - -
- -### 导出微调后的模型 - -```bash -python src/export_model.py \ - --model_name_or_path path_to_llama_model \ - --template default \ - --finetuning_type lora \ - --checkpoint_dir path_to_checkpoint \ - --output_dir path_to_export -``` - - -## 协议 - -本仓库的代码依照 [Apache-2.0](LICENSE) 协议开源。 - -使用模型权重时,请遵循对应的模型协议: - -- [LLaMA](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) -- [LLaMA-2](https://ai.meta.com/llama/license/) -- [BLOOM](https://huggingface.co/spaces/bigscience/license) -- [Falcon](LICENSE) -- [Baichuan](https://huggingface.co/baichuan-inc/baichuan-7B/resolve/main/baichuan-7B%20%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE.pdf) -- [InternLM](https://github.com/InternLM/InternLM#open-source-license) -- [Qwen](https://huggingface.co/Qwen/Qwen-7B-Chat/blob/main/LICENSE) -- [XVERSE](https://github.com/xverse-ai/XVERSE-13B/blob/main/MODEL_LICENSE.pdf) -- [ChatGLM2](https://github.com/THUDM/ChatGLM2-6B/blob/main/MODEL_LICENSE) - - -## 致谢 - -本项目训练代码来自于 [LLaMa-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning) 项目; - -本项目数据Prompt模板来自于 [NSQL](https://github.com/NumbersStationAI/NSQL) 项目。 \ No newline at end of file diff --git a/docs/eval_llm_result.md b/docs/eval_llm_result.md index 361460b..623d5ba 100644 --- a/docs/eval_llm_result.md +++ b/docs/eval_llm_result.md @@ -1,19 +1,25 @@ # Evaluation LLM For Text-to-SQL -This doc aims to summarize the performance of publicly available big language models when evaluated on the spider dataset. We hope it will provide a point of reference for folks using these big models for Text-to-SQL tasks. We'll keep sharing eval results from models we've tested and seen others use, and very welcome any contributions to make this more comprehensive. +This doc aims to summarize the performance of publicly available big language models when evaluated on the spider dev dataset. We hope it will provide a point of reference for folks using these big models for Text-to-SQL tasks. We'll keep sharing eval results from models we've tested and seen others use, and very welcome any contributions to make this more comprehensive. ## 1.LLMs Text-to-SQL capability evaluation | name | Execution Accuracy | reference | | ------------------------------ | ------------------ | ---------------------------------------------------------------------------------- | -| ChatGPT | 0.728 | [quote](https://www.numbersstation.ai/post/nsql-llama-2-7b) | -| GPT 4 | 0.762 | [quote](https://www.numbersstation.ai/post/nsql-llama-2-7b) | -| wizardcoder | 0.610 | [quote](https://github.com/cuplv/text-to-sql-wizardcoder/tree/main) | -| llama2_13b_hf | 0.252 | run in this project,default param set | -| llama2_13b_hf_lora | 0.697 | run in this project,default param set | -| CodeLlama-7b-Instruct-hf_qlora | 0.774 | run in this project,in refactor branch, with qlora and nf4,bit4 SFT, epoch8 | - -We will support CodeLlama with lora soon ,and give more exp . -Starring our project is the best way to encourage us and motivate us to release more code and experiment results. +| **GPT-4** | **0.762** | [numbersstation-eval-res](https://www.numbersstation.ai/post/nsql-llama-2-7b) | +| ChatGPT | 0.728 | [numbersstation-eval-res](https://www.numbersstation.ai/post/nsql-llama-2-7b)| +| **CodeLlama-13b-Instruct-hf_lora** | **0.789** | sft train by our this project,only used spider train dataset ,the same eval way in this project with lora SFT | +| CodeLlama-13b-Instruct-hf_qlora | 0.774 | sft train by our this project,only used spider train dataset ,the same eval way in this project with qlora and nf4,bit4 SFT | +| wizardcoder | 0.610 | [text-to-sql-wizardcoder](https://github.com/cuplv/text-to-sql-wizardcoder/tree/main) | +|CodeLlama-13b-Instruct-hf| 0.556 | eval in this project default param| +|Baichuan2-13B-Chat|0.392| eval in this project default param| +| llama2_13b_hf | xxx | run in this project,default param set | +| llama2_13b_hf_lora_best | 0.744 | sft train by our this project,only used spider train dataset ,the same eval way in this project | + + + +It's important to note that our evaluation results are obtained based on the current project's relevant parameters. We strive to provide objective assessments, but due to variations in certain parameters like the temperature value, different individuals may derive different results using different methods. These results should be taken as **reference only**. We welcome more peers to contribute your evaluation results (along with the corresponding parameter values). + +If you have improved methods for objective evaluation, we warmly welcome contributions to the project's codebase. ## 2. Acknowledgements