From 7e9a83586aab3fb217ded81ff3a5bcda2433fa9f Mon Sep 17 00:00:00 2001 From: wangzaistone Date: Fri, 8 Dec 2023 17:35:05 +0800 Subject: [PATCH] docs:eval add exp and summary res,make base eval as same ,del filter --- docs/eval_llm_result.md | 276 +++++++++++++++++++++++++++++++++++++++- 1 file changed, 274 insertions(+), 2 deletions(-) diff --git a/docs/eval_llm_result.md b/docs/eval_llm_result.md index 1431b90..5be4af1 100644 --- a/docs/eval_llm_result.md +++ b/docs/eval_llm_result.md @@ -17,8 +17,7 @@ This doc aims to summarize the performance of publicly available big language mo | Baichuan2-13B-Chat | 0.392 | eval in this project default param | | llama2_13b_hf | 0.449 | [numbersstation-eval-res](https://www.numbersstation.ai/post/nsql-llama-2-7b) | | llama2_13b_hf_lora_best | 0.744 | sft train by our this project,only used spider train dataset, the same eval way in this project. | -| chatglm3_lora_default | 0.590 | sft train by our this project,only used spider train dataset, the same eval way in this project. | -| chatglm3_qlora_default | 0.581 | sft train by our this project,only used spider train dataset, the same eval way in this project. | + @@ -28,6 +27,279 @@ It's important to note that our evaluation results are obtained based on the cur If you have improved methods for objective evaluation, we warmly welcome contributions to the project's codebase. +## LLMs Text-to-SQL capability evaluation before 20231208 + the follow our experiment execution accuracy of Spider, this time is base on the database which is download from the [the Spider-based test-suite](https://github.com/taoyds/test-suite-sql-eval) ,size of 1.27G, diffrent from Spider official [website](https://yale-lily.github.io/spider) ,size only 95M. +the model + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelMethodEX
Llama2-7B-Chatbase00000
lora0.8870.6410.4890.3310.626
qlora0.8470.6230.4660.3610.608
Llama2-13B-Chatbase00000
lora0.9070.7290.5520.3430.68
qlora0.9110.70.5520.3190.664
CodeLlama-7B-Instructbase0.2140.1770.0920.0360.149
lora0.9230.7560.5860.3490.702
qlora0.9110.7510.5980.3310.696
CodeLlama-13B-Instructbase0.6980.6010.4080.2710.539
lora0.940.7890.6840.4040.746
qlora0.940.7740.6260.3920.727
Baichuan2-7B-Chatbase0.5770.3520.2010.066335
lora0.8710.630.4480.2950.603
qlora0.8910.6370.4890.3310.624
Baichuan2-13B-Chatbase0.5810.4130.2640.1870.392
lora0.9030.7020.5690.3920.678
qlora0.8950.6750.580.3430.659
Qwen-7B-Chatbase0.3950.2560.1380.0420.235
lora0.8550.6880.5750.3310.652
qlora0.9110.6750.5750.3430.662
Qwen-14B-Chatbase0.8710.6320.3680.1810.573
lora0.8950.7020.5520.3310.663
qlora0.9190.7440.5980.3670.701
ChatGLM3-6bbase00000
lora0.8550.6050.4770.2710.59
qlora0.8430.6030.5060.2110.581
+ + +1、All the models lora and qlora are trained by default based on the spider dataset training set. +2、All candidate models adopt the same evaluation method and prompt. The prompt has explicitly required the model to output only sql. The base evaluation results of Llama2-7B-Chat, Llama2-13B-Chat, and ChatGLM3-6b are 0. After analysis, we see that there are many errors because content other than sql has been generated. + + ## 2. Acknowledgements Thanks to the following open source projects.