You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Could you share about the evaluation of code completion tasks such as HumanEval and HumanEval+? Particularly for the evaluation of pre-trained models, and what prompts were used.
I was able to infer the prompt used for post-trained evaluation for HumanEval here, but there is no corresponding results in the evals for the pre-trained models here.
I have used both VLLM and HF to generate outputs greedily and have never been able to achieve the results stated in the technical report for pre-trained models. On top of that, I have played with the batch sizes to remove padding and run inferences with padding too.
Could you share about the evaluation of code completion tasks such as HumanEval and HumanEval+? Particularly for the evaluation of pre-trained models, and what prompts were used.
I was able to infer the prompt used for post-trained evaluation for HumanEval here, but there is no corresponding results in the evals for the pre-trained models here.
I have used both VLLM and HF to generate outputs greedily and have never been able to achieve the results stated in the technical report for pre-trained models. On top of that, I have played with the batch sizes to remove padding and run inferences with padding too.
More details on this evaluation is much appreciated. Thank you!
The text was updated successfully, but these errors were encountered: