Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HumanEval Evaluation Details #168

Open
phoongkhangzhie opened this issue Oct 4, 2024 · 0 comments
Open

HumanEval Evaluation Details #168

phoongkhangzhie opened this issue Oct 4, 2024 · 0 comments

Comments

@phoongkhangzhie
Copy link

Could you share about the evaluation of code completion tasks such as HumanEval and HumanEval+? Particularly for the evaluation of pre-trained models, and what prompts were used.

I was able to infer the prompt used for post-trained evaluation for HumanEval here, but there is no corresponding results in the evals for the pre-trained models here.

I have used both VLLM and HF to generate outputs greedily and have never been able to achieve the results stated in the technical report for pre-trained models. On top of that, I have played with the batch sizes to remove padding and run inferences with padding too.

  • Llama-3.1-8B [Reported: 37.2 +/- 7.4, Replicated Results: 23.78]
  • Llama-3.1-70B [Reported: 58.5 +/- 7.5, Replicated Results: 15.18]

More details on this evaluation is much appreciated. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant