测试llama-7b的指标与opencompass贴出来的指标不一致 #256

yangjianxin1 · 2023-08-23T14:22:56Z

yangjianxin1
Aug 23, 2023

使用opencompass测试llama-7b的ceval指标，各个学科的平均分为24.66，但opencompass官方贴出来的结果为27.3。请问哪里会导致差异。

本地的评测代码如下：

from mmengine.config import read_base
from opencompass.models import HuggingFaceCausalLM

models = [
    # LLaMA 7B
    dict(
        type=HuggingFaceCausalLM,
        abbr='llama-7b-hf',
        path="huggyllama/llama-7b",
        tokenizer_path='huggyllama/llama-7b',
        tokenizer_kwargs=dict(padding_side='left',
                              truncation_side='left',
                              use_fast=False,
                              ),
        max_out_len=100,
        max_seq_len=2048,
        batch_size=8,
        model_kwargs=dict(device_map='auto'),
        batch_padding=False, # if false, inference with for-loop without batch padding
        run_cfg=dict(num_gpus=1, num_procs=1),
    )
]

with read_base():
    from .datasets.ceval.ceval_ppl import ceval_datasets

datasets = [*ceval_datasets]

Answered by yangjianxin1

Aug 26, 2023

you are right, my torch is 2.0, the reason is the dependency version.
when I change the version the same with yours, I get the right result.

View full answer

yangjianxin1 · 2023-08-23T15:19:54Z

yangjianxin1
Aug 23, 2023
Author

将52个学科的分数求平均分，得到24.66

0 replies

yangjianxin1 · 2023-08-24T14:49:17Z

yangjianxin1
Aug 24, 2023
Author

评测了chinese-llama-2-7b, 测出来的分数与opencompass差距甚远。
如果算52个学科的平均分为：24.28
如果算4中task的平均分为：24.82
均与opencompass的leaderboard上的31.2差距甚远。

dataset	chinese-llama-2-7b
ceval-computer_network	21.05
ceval-operating_system	15.79
ceval-computer_architecture	28.57
ceval-college_programming	32.43
ceval-college_physics	15.79
ceval-college_chemistry	16.67
ceval-advanced_mathematics	31.58
ceval-probability_and_statistics	11.11
ceval-discrete_mathematics	37.5
ceval-electrical_engineer	21.62
ceval-metrology_engineer	12.5
ceval-high_school_mathematics	22.22
ceval-high_school_physics	15.79
ceval-high_school_chemistry	21.05
ceval-high_school_biology	42.11
ceval-middle_school_mathematics	10.53
ceval-middle_school_biology	28.57
ceval-middle_school_physics	21.05
ceval-middle_school_chemistry	15
ceval-veterinary_medicine	21.74
ceval-college_economics	29.09
ceval-business_administration	27.27
ceval-marxism	31.58
ceval-mao_zedong_thought	45.83
ceval-education_science	27.59
ceval-teacher_qualification	36.36
ceval-high_school_politics	21.05
ceval-high_school_geography	26.32
ceval-middle_school_politics	28.57
ceval-middle_school_geography	16.67
ceval-modern_chinese_history	17.39
ceval-ideological_and_moral_cultivation	31.58
ceval-logic	18.18
ceval-law	25
ceval-chinese_language_and_literature	30.43
ceval-art_studies	45.45
ceval-professional_tour_guide	37.93
ceval-legal_professional	4.35
ceval-high_school_chinese	21.05
ceval-high_school_history	35
ceval-middle_school_history	13.64
ceval-civil_servant	25.53
ceval-sports_science	21.05
ceval-plant_protection	36.36
ceval-basic_medicine	5.26
ceval-clinical_medicine	22.73
ceval-urban_and_rural_planner	23.91
ceval-accountant	22.45
ceval-fire_engineer	35.48
ceval-environmental_impact_assessment_engineer	9.68
ceval-tax_accountant	20.41
ceval-physician	26.53

评测脚本如下：

from mmengine.config import read_base
from opencompass.models import HuggingFaceCausalLM

batch_size = 20
# 指定评测模型
model_name_or_paths = [
    'ziqingyang/chinese-llama-2-7b'
]

models = []
for model_name_or_path in model_name_or_paths:
    model = dict(
            type=HuggingFaceCausalLM,
            abbr=model_name_or_path,
            path=model_name_or_path,
            tokenizer_path=model_name_or_path,
            tokenizer_kwargs=dict(padding_side='left',
                                  truncation_side='left',
                                  use_fast=False,
                                  trust_remote_code=True
                                  ),
            max_out_len=100,
            max_seq_len=2048,
            batch_size=batch_size,
            model_kwargs=dict(device_map='auto', trust_remote_code=True),
            batch_padding=False, # if false, inference with for-loop without batch padding
            run_cfg=dict(num_gpus=2, num_procs=2),
        )
    models.append(model)


# 指定评测集
with read_base():
    from .datasets.ceval.ceval_ppl import ceval_datasets
    # from .datasets.collections.base_medium import datasets
    # from .models.llama2_7b import models

datasets = [*ceval_datasets]


# python run.py configs/eval_demo.py -w outputs/demo

0 replies

Leymore · 2023-08-25T03:42:51Z

Leymore
Aug 25, 2023

I reran my test after encountering this issue. Here is my reproduced details:

dataset	version	metric	mode	llama-7b-hf	chinese-llama-2-7b-hf
ceval-computer_network	9b9417	accuracy	ppl	36.84	26.32
ceval-operating_system	b2b8cf	accuracy	ppl	15.79	21.05
ceval-computer_architecture	1bd275	accuracy	ppl	33.33	38.10
ceval-college_programming	2d0833	accuracy	ppl	18.92	35.14
ceval-college_physics	fb7e04	accuracy	ppl	31.58	21.05
ceval-college_chemistry	916b7d	accuracy	ppl	29.17	33.33
ceval-advanced_mathematics	5cad2a	accuracy	ppl	15.79	36.84
ceval-probability_and_statistics	a6b30e	accuracy	ppl	38.89	22.22
ceval-discrete_mathematics	68be68	accuracy	ppl	25.00	50.00
ceval-electrical_engineer	056c2e	accuracy	ppl	21.62	27.03
ceval-metrology_engineer	4a757a	accuracy	ppl	29.17	20.83
ceval-high_school_mathematics	a8ed21	accuracy	ppl	38.89	16.67
ceval-high_school_physics	e1fc86	accuracy	ppl	26.32	21.05
ceval-high_school_chemistry	9021c6	accuracy	ppl	15.79	36.84
ceval-high_school_biology	c7f5a1	accuracy	ppl	10.53	21.05
ceval-middle_school_mathematics	213989	accuracy	ppl	21.05	21.05
ceval-middle_school_biology	ce0420	accuracy	ppl	28.57	28.57
ceval-middle_school_physics	78f3af	accuracy	ppl	31.58	36.84
ceval-middle_school_chemistry	d071d2	accuracy	ppl	30.00	25.00
ceval-veterinary_medicine	cd3a07	accuracy	ppl	39.13	30.43
ceval-college_economics	a35346	accuracy	ppl	25.45	32.73
ceval-business_administration	69dd6a	accuracy	ppl	27.27	39.39
ceval-marxism	283ce0	accuracy	ppl	36.84	36.84
ceval-mao_zedong_thought	f38cd1	accuracy	ppl	29.17	50.00
ceval-education_science	fbd65c	accuracy	ppl	10.34	31.03
ceval-teacher_qualification	c77f1f	accuracy	ppl	25.00	43.18
ceval-high_school_politics	bbac37	accuracy	ppl	52.63	26.32
ceval-high_school_geography	730a30	accuracy	ppl	21.05	21.05
ceval-middle_school_politics	15b2d7	accuracy	ppl	28.57	38.10
ceval-middle_school_geography	b00167	accuracy	ppl	33.33	16.67
ceval-modern_chinese_history	5a04cd	accuracy	ppl	21.74	30.43
ceval-ideological_and_moral_cultivation	0829ff	accuracy	ppl	21.05	52.63
ceval-logic	c9c394	accuracy	ppl	22.73	27.27
ceval-law	cbd3c5	accuracy	ppl	20.83	16.67
ceval-chinese_language_and_literature	716ab3	accuracy	ppl	21.74	26.09
ceval-art_studies	476114	accuracy	ppl	21.21	45.45
ceval-professional_tour_guide	70f30f	accuracy	ppl	31.03	41.38
ceval-legal_professional	f19cf5	accuracy	ppl	30.43	26.09
ceval-high_school_chinese	931614	accuracy	ppl	31.58	21.05
ceval-high_school_history	4d6364	accuracy	ppl	30.00	55.00
ceval-middle_school_history	7f6356	accuracy	ppl	13.64	36.36
ceval-civil_servant	a5dcb8	accuracy	ppl	34.04	27.66
ceval-sports_science	192553	accuracy	ppl	52.63	36.84
ceval-plant_protection	f7ff86	accuracy	ppl	36.36	36.36
ceval-basic_medicine	a95a09	accuracy	ppl	26.32	26.32
ceval-clinical_medicine	664b54	accuracy	ppl	27.27	18.18
ceval-urban_and_rural_planner	fdae6f	accuracy	ppl	21.74	43.48
ceval-accountant	d810a1	accuracy	ppl	24.49	28.57
ceval-fire_engineer	bb924d	accuracy	ppl	25.81	32.26
ceval-environmental_impact_assessment_engineer	d59200	accuracy	ppl	38.71	22.58
ceval-tax_accountant	9e16f2	accuracy	ppl	18.37	28.57
ceval-physician	0e90d5	accuracy	ppl	22.45	38.78
ceval-stem	-	naive_average	ppl	26.90	28.47
ceval-social-science	-	naive_average	ppl	28.97	33.53
ceval-humanities	-	naive_average	ppl	24.18	34.40
ceval-other	-	naive_average	ppl	29.84	30.87
ceval-hard	-	naive_average	ppl	27.68	29.75
ceval	-	naive_average	ppl	27.34	31.21

The immediate output files: 20230825_032645.zip

7 replies

yangjianxin1 Aug 25, 2023
Author

how do you output the ceval-stem, ceval-humanities, ceval-other, ceval-social-science and ceval score?

when i run the following script, it doesn't output the 5 scores above:

python run.py configs/eval_chinese_llama.py -w outputs/chinese-llama-2

Leymore Aug 25, 2023

You can get the averaged score with the following code appending to your config

with read_base():
    from .summarizers.example import summarizer

Leymore Aug 25, 2023

I believe your ziqingyang/chinese-llama-2-7b config is correct. I reproduced ceval 31.21 with your config again.

I am wondering whether the problem lies in the dependency version or the huggingface model revision.

Leymore Aug 25, 2023

Here is my config, basicly the same with yours:

config.py.txt

yangjianxin1 Aug 25, 2023
Author

thanks for your kindly reply, I will check the dependency version and the huggingface model revision.
if I get the right result, I will share the reasons here.

Leymore · 2023-08-25T04:30:08Z

Leymore
Aug 25, 2023

HF model revisions:
ziqingyang/chinese-llama-2-7b: 557b5cbd48a4a4eb5a08e975c4b6e11ac1ed4cbc
huggyllama/llama-7b: 8416d3fefb0cb3ff5775a7b13c1692d10ff1aa16

Dependency versions:
torch==1.13.1 transformers==4.32.0 tokenizers==0.13.3

1 reply

yangjianxin1 Aug 26, 2023
Author

you are right, my torch is 2.0, the reason is the dependency version.
when I change the version the same with yours, I get the right result.

Answer selected by Leymore

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

测试llama-7b的指标与opencompass贴出来的指标不一致 #256

{{title}}

Replies: 4 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

测试llama-7b的指标与opencompass贴出来的指标不一致 #256

yangjianxin1 Aug 23, 2023

Replies: 4 comments · 8 replies

yangjianxin1 Aug 23, 2023 Author

yangjianxin1 Aug 24, 2023 Author

Leymore Aug 25, 2023

yangjianxin1 Aug 25, 2023 Author

Leymore Aug 25, 2023

Leymore Aug 25, 2023

Leymore Aug 25, 2023

yangjianxin1 Aug 25, 2023 Author

Leymore Aug 25, 2023

yangjianxin1 Aug 26, 2023 Author

yangjianxin1
Aug 23, 2023

Replies: 4 comments 8 replies

yangjianxin1
Aug 23, 2023
Author

yangjianxin1
Aug 24, 2023
Author

Leymore
Aug 25, 2023

yangjianxin1 Aug 25, 2023
Author

yangjianxin1 Aug 25, 2023
Author

Leymore
Aug 25, 2023

yangjianxin1 Aug 26, 2023
Author