HumanEval Evaluation Details #168

phoongkhangzhie · 2024-10-04T06:28:25Z

Could you share about the evaluation of code completion tasks such as HumanEval and HumanEval+? Particularly for the evaluation of pre-trained models, and what prompts were used.

I was able to infer the prompt used for post-trained evaluation for HumanEval here, but there is no corresponding results in the evals for the pre-trained models here.

I have used both VLLM and HF to generate outputs greedily and have never been able to achieve the results stated in the technical report for pre-trained models. On top of that, I have played with the batch sizes to remove padding and run inferences with padding too.

Llama-3.1-8B [Reported: 37.2 +/- 7.4, Replicated Results: 23.78]
Llama-3.1-70B [Reported: 58.5 +/- 7.5, Replicated Results: 15.18]

More details on this evaluation is much appreciated. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HumanEval Evaluation Details #168

HumanEval Evaluation Details #168

phoongkhangzhie commented Oct 4, 2024

HumanEval Evaluation Details #168

HumanEval Evaluation Details #168

Comments

phoongkhangzhie commented Oct 4, 2024