Found out that using A100 and V100 on Vicuna and Llama2 have a different result, while other model such as Falcon doesn't has such question. The results are here.
Running the experiment on Google Colab Pro +
We use four LLM benchmarks to evaluate the model.
- Hellaswag: acc
- Truthfulqa_mc: mc1, mc2
- Arc_challenge: acc
- MMLU(HendrycksTest): Average score of all test acc.