[Bug] v0.3.5版本评测Qwen/Qwen2.5-72B得分显著下降 #1675

guoshengCS · 2024-11-11T11:10:32Z

Prerequisite

I have searched Issues and Discussions but cannot get the expected help.
The bug has not been fixed in the latest version.

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

torch==2.2.0+vllm==0.4.0+OpenCompass==0.3.5

Reproduces the problem - code/configuration sample

使用如下评测配置评测Qwen/Qwen2.5-72B'

from mmengine.config import read_base

with read_base():
    from .datasets.collections.leaderboard.qwen import datasets
    from .summarizers.leaderboard import summarizer


from opencompass.models import VLLM, HuggingFaceBaseModel


models = [
    dict(
        type=VLLM,
        abbr='qwen2.5-72b-vllm',
        path='Qwen/Qwen2.5-72B',
        model_kwargs=dict(
            tensor_parallel_size=4,
            gpu_memory_utilization=0.8,  # set this to avoid OOM temporarily
            enforce_eager=True,
        ),
        stop_words=['<|endoftext|>', '<|im_end|>'],
        max_out_len=128,
        max_seq_len=8192,
        batch_size=16,
        generation_kwargs=dict(  # args for vllm.SamplingParams
            temperature=0,  #
        ),
        run_cfg=dict(num_gpus=4),
    )
]

Reproduces the problem - command or script

直接使用run.py运行上面的评测配置文件，部分任务在最新的v0.3.5版本得分较低，相较早先v0.2.5(commit e0d7808)版本得分大幅下降

Reproduces the problem - error message

左为v0.3.5版本得分 vs. 右为早先代码版本得分

Other information

No response

The text was updated successfully, but these errors were encountered:

guoshengCS · 2024-11-11T11:14:53Z

主要是更新最新代码后leaderboard/qwen.py里一些任务的评测分数变化，大概看了下，其中：

math评分，新代码更新了math_postprocess_v2，得分49.88->4.24，这个有留言 Upgrade default math pred_postprocessor #1340 (comment)
humaneval评分，新代码更新了humaneval_postprocess_v2，得分41.46->7.93
ARC-c、ARC-e、openbookqa_fact、AX_b、AX_g、COPA、hellaswag、piqa评分，用的first_option_postprocess的新代码有改动，ARC-c得分88->24，貌似是有个新增pattern的影响另外看里面用match.group(0)好像不太对

tonysy · 2024-11-13T12:39:55Z

Thanks for the report, we will follow this issue and check the problem.

MaiziXiao · 2024-11-14T08:41:51Z

第三点已在https://github.com/open-compass/opencompass/pull/1688/files 修复，拉取下最新的代码重新跑一下评估。
针对 base 模型，我们后续会发布专门针对基座模型的评测配置

guoshengCS · 2024-11-14T11:46:01Z

第三点已在https://github.com/open-compass/opencompass/pull/1688/files 修复，拉取下最新的代码重新跑一下评估。针对 base 模型，我们后续会发布专门针对基座模型的评测配置

辛苦修复~ 另外这里使用的评测配置是leaderboard/qwen.py，看还有leaderboard/qwen_chat.py，所以并不是qwen.py给base模型用、qwen_chat.py给chat模型用的吗？当前评测只修改match.group(0)这个的话得分确实还是比较低（ARC-c得分31.86 ）还是有问题，当前没有能比较好给base模型用的评测配置是吗

guoshengCS · 2024-11-14T11:52:25Z

另外还想问下，咱们新版本对于instruct模型已经默认使用HuggingFacewithChatTemplate/VLLMwithChatTemplate了，但是这些类没有实现get_ppl方法，在评测使用PPLInferencer的数据配置时会报错，instruct模型是预期不支持PPL方式评测吗

File "/checkpoint/binary/train_package/opencompass/models/base.py", line 84, in get_ppl
raise NotImplementedError(f'{self.__class__.__name__} does not support'
NotImplementedError: VLLMwithChatTemplate does not support ppl-based evaluation yet, try gen-based instead.

BIGWangYuDong · 2024-11-15T03:14:56Z

另外还想问下，咱们新版本对于instruct模型已经默认使用HuggingFacewithChatTemplate/VLLMwithChatTemplate了，但是这些类没有实现get_ppl方法，在评测使用PPLInferencer的数据配置时会报错，instruct模型是预期不支持PPL方式评测吗
File "/checkpoint/binary/train_package/opencompass/models/base.py", line 84, in get_ppl
raise NotImplementedError(f'{self.__class__.__name__} does not support'
NotImplementedError: VLLMwithChatTemplate does not support ppl-based evaluation yet, try gen-based instead.

借楼，对于 HuggingFacewithChatTemplate/VLLMwithChatTemplate 我也有一个疑问，就是 template_parser 从 LMTemplateParser 改成了 APITemplateParser。但是之前有一些 DIY 的配置貌似就不通用了，并且 prediction 里面保存的信息看不到实际传输给模型的全量文本。

LMTemplateParser 和 APITemplateParser 大概看了看源码感觉目前拼接策略好像不太一致，而且存在 api_role 这个强制 key，存在了 BC，比如 begin 和 end 这个地方，可不可以补充和优化一下 meta template 文档

MaiziXiao · 2024-11-18T12:21:28Z

VLLMwithChatTemplate

#1699 辛苦再试一下。
qwen.py是给base模型用，只是由于基座模型指令跟从能力差和后处理鲁棒性的问题，导致某些模型测出来分会较低。支持 PPL的数据集一般会用 PPL测基座

MaiziXiao · 2024-11-18T12:22:42Z

另外还想问下，咱们新版本对于instruct模型已经默认使用HuggingFacewithChatTemplate/VLLMwithChatTemplate了，但是这些类没有实现get_ppl方法，在评测使用PPLInferencer的数据配置时会报错，instruct模型是预期不支持PPL方式评测吗
File "/checkpoint/binary/train_package/opencompass/models/base.py", line 84, in get_ppl
raise NotImplementedError(f'{self.__class__.__name__} does not support'
NotImplementedError: VLLMwithChatTemplate does not support ppl-based evaluation yet, try gen-based instead.

这个是因为 vLLM框架本身不返回 logits 导致无法计算 PPL，你可以尝试用 LMDeploywithChatTemplate 来跑PPL的方式评测。instruct模型一般建议用 gen 的方式评测

MaiziXiao · 2024-11-18T12:23:13Z

另外还想问下，咱们新版本对于instruct模型已经默认使用HuggingFacewithChatTemplate/VLLMwithChatTemplate了，但是这些类没有实现get_ppl方法，在评测使用PPLInferencer的数据配置时会报错，instruct模型是预期不支持PPL方式评测吗
File "/checkpoint/binary/train_package/opencompass/models/base.py", line 84, in get_ppl
raise NotImplementedError(f'{self.__class__.__name__} does not support'
NotImplementedError: VLLMwithChatTemplate does not support ppl-based evaluation yet, try gen-based instead.
借楼，对于 HuggingFacewithChatTemplate/VLLMwithChatTemplate 我也有一个疑问，就是 template_parser 从 LMTemplateParser 改成了 APITemplateParser。但是之前有一些 DIY 的配置貌似就不通用了，并且 prediction 里面保存的信息看不到实际传输给模型的全量文本。

LMTemplateParser 和 APITemplateParser 大概看了看源码感觉目前拼接策略好像不太一致，而且存在 api_role 这个强制 key，存在了 BC，比如 begin 和 end 这个地方，可不可以补充和优化一下 meta template 文档

收到，我们后续补充和优化下 meta tamplate 的文档

guoshengCS · 2024-11-22T08:16:46Z

另外还想问下，咱们新版本对于instruct模型已经默认使用HuggingFacewithChatTemplate/VLLMwithChatTemplate了，但是这些类没有实现get_ppl方法，在评测使用PPLInferencer的数据配置时会报错，instruct模型是预期不支持PPL方式评测吗
File "/checkpoint/binary/train_package/opencompass/models/base.py", line 84, in get_ppl
raise NotImplementedError(f'{self.__class__.__name__} does not support'
NotImplementedError: VLLMwithChatTemplate does not support ppl-based evaluation yet, try gen-based instead.
这个是因为 vLLM框架本身不返回 logits 导致无法计算 PPL，你可以尝试用 LMDeploywithChatTemplate 来跑PPL的方式评测。instruct模型一般建议用 gen 的方式评测

vLLM可以返回logits的，咱们VLLM就支持get_ppl的 https://github.com/open-compass/opencompass/blob/0.3.5/opencompass/models/vllm.py#L110 ，只是VLLMwithChatTemplate不支持get_ppl，另外看LMDeploywithChatTemplate/TurboMindModelwithChatTemplate 貌似也没有实现get_ppl呢

mm-assistant bot assigned bittersweet1999 Nov 11, 2024

guoshengCS changed the title ~~[Bug]~~ [Bug] v0.3.5版本评测Qwen/Qwen2.5-72B得分显著下降 Nov 11, 2024

tonysy assigned MaiziXiao Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] v0.3.5版本评测Qwen/Qwen2.5-72B得分显著下降 #1675

[Bug] v0.3.5版本评测Qwen/Qwen2.5-72B得分显著下降 #1675

guoshengCS commented Nov 11, 2024 •

edited

Loading

guoshengCS commented Nov 11, 2024

tonysy commented Nov 13, 2024

MaiziXiao commented Nov 14, 2024

guoshengCS commented Nov 14, 2024 •

edited

Loading

guoshengCS commented Nov 14, 2024

BIGWangYuDong commented Nov 15, 2024

MaiziXiao commented Nov 18, 2024

MaiziXiao commented Nov 18, 2024

MaiziXiao commented Nov 18, 2024

guoshengCS commented Nov 22, 2024 •

edited

Loading

[Bug] v0.3.5版本评测Qwen/Qwen2.5-72B得分显著下降 #1675

[Bug] v0.3.5版本评测Qwen/Qwen2.5-72B得分显著下降 #1675

Comments

guoshengCS commented Nov 11, 2024 • edited Loading

Prerequisite

Type

Environment

Reproduces the problem - code/configuration sample

Reproduces the problem - command or script

Reproduces the problem - error message

Other information

guoshengCS commented Nov 11, 2024

tonysy commented Nov 13, 2024

MaiziXiao commented Nov 14, 2024

guoshengCS commented Nov 14, 2024 • edited Loading

guoshengCS commented Nov 14, 2024

BIGWangYuDong commented Nov 15, 2024

MaiziXiao commented Nov 18, 2024

MaiziXiao commented Nov 18, 2024

MaiziXiao commented Nov 18, 2024

guoshengCS commented Nov 22, 2024 • edited Loading

guoshengCS commented Nov 11, 2024 •

edited

Loading

guoshengCS commented Nov 14, 2024 •

edited

Loading

guoshengCS commented Nov 22, 2024 •

edited

Loading