Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs:eval add exp and summary res,make base eval as same ,del filter #177

Merged
merged 1 commit into from
Dec 8, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
276 changes: 274 additions & 2 deletions docs/eval_llm_result.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,7 @@ This doc aims to summarize the performance of publicly available big language mo
| Baichuan2-13B-Chat | 0.392 | eval in this project default param |
| llama2_13b_hf | 0.449 | [numbersstation-eval-res](https://www.numbersstation.ai/post/nsql-llama-2-7b) |
| llama2_13b_hf_lora_best | 0.744 | sft train by our this project,only used spider train dataset, the same eval way in this project. |
| chatglm3_lora_default | 0.590 | sft train by our this project,only used spider train dataset, the same eval way in this project. |
| chatglm3_qlora_default | 0.581 | sft train by our this project,only used spider train dataset, the same eval way in this project. |




Expand All @@ -28,6 +27,279 @@ It's important to note that our evaluation results are obtained based on the cur
If you have improved methods for objective evaluation, we warmly welcome contributions to the project's codebase.


## LLMs Text-to-SQL capability evaluation before 20231208
the follow our experiment execution accuracy of Spider, this time is base on the database which is download from the [the Spider-based test-suite](https://github.com/taoyds/test-suite-sql-eval) ,size of 1.27G, diffrent from Spider official [website](https://yale-lily.github.io/spider) ,size only 95M.
the model

<table>
<tr>
<th>Model</th>
<th>Method</th>
<th>EX</th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
<tr>
<td>Llama2-7B-Chat</td>
<td>base</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td></td>
<td>lora</td>
<td>0.887</td>
<td>0.641</td>
<td>0.489</td>
<td>0.331</td>
<td>0.626</td>
</tr>
<tr>
<td></td>
<td>qlora</td>
<td>0.847</td>
<td>0.623</td>
<td>0.466</td>
<td>0.361</td>
<td>0.608</td>
</tr>
<tr>
<td>Llama2-13B-Chat</td>
<td>base</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td></td>
<td>lora</td>
<td>0.907</td>
<td>0.729</td>
<td>0.552</td>
<td>0.343</td>
<td>0.68</td>
</tr>
<tr>
<td></td>
<td>qlora</td>
<td>0.911</td>
<td>0.7</td>
<td>0.552</td>
<td>0.319</td>
<td>0.664</td>
</tr>
<tr>
<td>CodeLlama-7B-Instruct</td>
<td>base</td>
<td>0.214</td>
<td>0.177</td>
<td>0.092</td>
<td>0.036</td>
<td>0.149</td>
</tr>
<tr>
<td></td>
<td>lora</td>
<td>0.923</td>
<td>0.756</td>
<td>0.586</td>
<td>0.349</td>
<td>0.702</td>
</tr>
<tr>
<td></td>
<td>qlora</td>
<td>0.911</td>
<td>0.751</td>
<td>0.598</td>
<td>0.331</td>
<td>0.696</td>
</tr>
<tr>
<td>CodeLlama-13B-Instruct</td>
<td>base</td>
<td>0.698</td>
<td>0.601</td>
<td>0.408</td>
<td>0.271</td>
<td>0.539</td>
</tr>
<tr>
<td></td>
<td>lora</td>
<td>0.94</td>
<td>0.789</td>
<td>0.684</td>
<td>0.404</td>
<td>0.746</td>
</tr>
<tr>
<td></td>
<td>qlora</td>
<td>0.94</td>
<td>0.774</td>
<td>0.626</td>
<td>0.392</td>
<td>0.727</td>
</tr>
<tr>
<td>Baichuan2-7B-Chat</td>
<td>base</td>
<td>0.577</td>
<td>0.352</td>
<td>0.201</td>
<td>0.066</td>
<td>335</td>
</tr>
<tr>
<td></td>
<td>lora</td>
<td>0.871</td>
<td>0.63</td>
<td>0.448</td>
<td>0.295</td>
<td>0.603</td>
</tr>
<tr>
<td></td>
<td>qlora</td>
<td>0.891</td>
<td>0.637</td>
<td>0.489</td>
<td>0.331</td>
<td>0.624</td>
</tr>
<tr>
<td>Baichuan2-13B-Chat</td>
<td>base</td>
<td>0.581</td>
<td>0.413</td>
<td>0.264</td>
<td>0.187</td>
<td>0.392</td>
</tr>
<tr>
<td></td>
<td>lora</td>
<td>0.903</td>
<td>0.702</td>
<td>0.569</td>
<td>0.392</td>
<td>0.678</td>
</tr>
<tr>
<td></td>
<td>qlora</td>
<td>0.895</td>
<td>0.675</td>
<td>0.58</td>
<td>0.343</td>
<td>0.659</td>
</tr>
<tr>
<td>Qwen-7B-Chat</td>
<td>base</td>
<td>0.395</td>
<td>0.256</td>
<td>0.138</td>
<td>0.042</td>
<td>0.235</td>
</tr>
<tr>
<td></td>
<td>lora</td>
<td>0.855</td>
<td>0.688</td>
<td>0.575</td>
<td>0.331</td>
<td>0.652</td>
</tr>
<tr>
<td></td>
<td>qlora</td>
<td>0.911</td>
<td>0.675</td>
<td>0.575</td>
<td>0.343</td>
<td>0.662</td>
</tr>
<tr>
<td>Qwen-14B-Chat</td>
<td>base</td>
<td>0.871</td>
<td>0.632</td>
<td>0.368</td>
<td>0.181</td>
<td>0.573</td>
</tr>
<tr>
<td></td>
<td>lora</td>
<td>0.895</td>
<td>0.702</td>
<td>0.552</td>
<td>0.331</td>
<td>0.663</td>
</tr>
<tr>
<td></td>
<td>qlora</td>
<td>0.919</td>
<td>0.744</td>
<td>0.598</td>
<td>0.367</td>
<td>0.701</td>
</tr>
<tr>
<td>ChatGLM3-6b</td>
<td>base</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td></td>
<td>lora</td>
<td>0.855</td>
<td>0.605</td>
<td>0.477</td>
<td>0.271</td>
<td>0.59</td>
</tr>
<tr>
<td></td>
<td>qlora</td>
<td>0.843</td>
<td>0.603</td>
<td>0.506</td>
<td>0.211</td>
<td>0.581</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</table>


1、All the models lora and qlora are trained by default based on the spider dataset training set.
2、All candidate models adopt the same evaluation method and prompt. The prompt has explicitly required the model to output only sql. The base evaluation results of Llama2-7B-Chat, Llama2-13B-Chat, and ChatGLM3-6b are 0. After analysis, we see that there are many errors because content other than sql has been generated.


## 2. Acknowledgements
Thanks to the following open source projects.

Expand Down
Loading