[Bug] 主观评测得分解析错误 #3

WencWu · 2024-11-28T07:12:33Z

例如以下例子中 **Score: 9** 被错误解析为 5.0

{
    "question": "To cook perfectly golden pancakes,",
    "obj": {
        "generation_a": "Mix the ingredients together in a bowl and pour it onto a hot pan.",
        "generation_b": "To cook golden pancakes, mix flour, eggs, milk, and baking powder. Heat a non-stick pan, scoop batter, and cook until bubbles form. Flip and cook until golden. Serve with maple syrup.🥞",
        "preference": "ansB",
    },
    "sub": {
        "generation_a": "To cook golden pancakes, mix flour, eggs, milk, and baking powder. Heat a non-stick pan, scoop batter, and cook until bubbles form. Flip and cook until golden. Serve with maple syrup.🥞",
        "generation_b": "Mix the ingredients together in a bowl and pour it onto a hot pan.",
        "feedback": "Answer A provides a more detailed, step-by-step process of how to cook pancakes. It specifies the ingredients needed, the type of pan to use, and how to know when the pancake is ready to flip. It also suggests a way to serve the pancakes. Answer B, on the other hand, is more general and lacks these details. \n\nDecision: A.",
        "preference": "ansA",
        "llm_a": "openbuddy-70b",
        "llm_b": "internlm-7b",
    },
    "data_source": "piqa",
    "metadata": {"quality_a_sub": "high", "quality_b_sub": "low"},
    "evaluation": {
        "cot": "### Evaluation of Generated Feedback\n\n**Generated Feedback Analysis:**\n\n1. **Clarity and Detail:**\n   - The generated feedback accurately highlights that Answer A provides a clear and detailed step-by-step guide, mentioning specific ingredients and cooking steps. This aligns well with the reference feedback, which also emphasizes the detailed nature of Answer A.\n\n2. **Additional Information:**\n   - The generated feedback notes the suggestion of serving with maple syrup, which is a practical addition. This aspect is not explicitly mentioned in the reference feedback, making the generated feedback slightly more comprehensive in this regard.\n\n3. **Engagement:**\n   - The generated feedback mentions the use of an emoji in Answer A, adding an element of engagement. This is an additional observation not covered in the reference feedback, which could be seen as enhancing the evaluation by considering the engagement factor.\n\n4. **Critique of Answer B:**\n   - Both the generated and reference feedbacks agree that Answer B lacks detail and completeness. The generated feedback expands on this by pointing out the absence of engaging elements, which is a valid critique but not present in the reference feedback.\n\n5. **Overall Preference:**\n   - The generated feedback clearly states a preference for Answer A, providing reasons that are consistent with the reference feedback. It also adds value by discussing engagement and additional serving suggestions.\n\n**Comparison with Reference Feedback:**\n\n- The generated feedback provides a more comprehensive analysis by including aspects of engagement and additional serving suggestions, which are not present in the reference feedback. This makes the generated feedback slightly more detailed and engaging.\n\n**Score: 9**\n\nThe generated feedback is slightly better than the reference feedback due to its additional insights into engagement and serving suggestions, while still maintaining a clear and accurate analysis of the answers.",
        "score": 5.0,
    },
}

WencWu · 2024-11-29T09:41:47Z

还有 "Decision: .+" 也是同样的问题，若存在 "**Decision: 6**" 也无法正确解析，是否将正则表达式修改为 "Decision: [\.\d]+" 会更好呢？

Brovmos · 2024-12-02T20:37:00Z

Thank you for reaching to Our Github. Customer service may take some time to respond due to a large number of requests. We believe your case will be resolved as soon as possible. Kindly visit the HERE TO CONTACT OUR LIVE SUPPORT DEV for instant support response
Note: Click on the live chat icon at the bottom corner of the page

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] 主观评测得分解析错误 #3

[Bug] 主观评测得分解析错误 #3

WencWu commented Nov 28, 2024

WencWu commented Nov 29, 2024 •

edited

Loading

Brovmos commented Dec 2, 2024

[Bug] 主观评测得分解析错误 #3

[Bug] 主观评测得分解析错误 #3

Comments

WencWu commented Nov 28, 2024

WencWu commented Nov 29, 2024 • edited Loading

Brovmos commented Dec 2, 2024

WencWu commented Nov 29, 2024 •

edited

Loading