diff --git a/3rdparty/lm-evaluation-harness b/3rdparty/lm-evaluation-harness deleted file mode 160000 index a3520619..00000000 --- a/3rdparty/lm-evaluation-harness +++ /dev/null @@ -1 +0,0 @@ -Subproject commit a35206191acac1776761e737b66e0d04975d21b9 diff --git a/README.md b/README.md index 643be899..47a26edc 100644 --- a/README.md +++ b/README.md @@ -101,23 +101,20 @@ Performance is expected to be comparable or better than other architectures trai To run zero-shot evaluations of models (corresponding to Table 3 of the paper), we use the -[lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor) +[lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) library. -1. Pull the `lm-evaluation-harness` repo by `git submodule update --init - --recursive`. We use the `big-refactor` branch. -2. Install `lm-evaluation-harness`: `pip install -e 3rdparty/lm-evaluation-harness`. -On Python 3.10 you might need to manually install the latest version of `promptsource`: `pip install git+https://github.com/bigscience-workshop/promptsource.git`. -3. Run evaluation with (more documentation at the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor) repo): +1. Install `lm-evaluation-harness` by `pip install lm-eval==0.4.2`. +2. Run evaluation with (more documentation at the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor) repo): ``` -python evals/lm_harness_eval.py --model mamba --model_args pretrained=state-spaces/mamba-130m --tasks lambada_openai,hellaswag,piqa,arc_easy,arc_challenge,winogrande --device cuda --batch_size 64 +lm_eval --model mamba_ssm --model_args pretrained=state-spaces/mamba-130m --tasks lambada_openai,hellaswag,piqa,arc_easy,arc_challenge,winogrande,openbookqa --device cuda --batch_size 256 python evals/lm_harness_eval.py --model hf --model_args pretrained=EleutherAI/pythia-160m --tasks lambada_openai,hellaswag,piqa,arc_easy,arc_challenge,winogrande --device cuda --batch_size 64 ``` To reproduce the results on the `mamba-2.8b-slimpj` model reported in the blogposts: ``` -python evals/lm_harness_eval.py --model mamba --model_args pretrained=state-spaces/mamba-2.8b-slimpj --tasks boolq,piqa,hellaswag,winogrande,arc_easy,arc_challenge,openbookqa,race,truthfulqa_mc2 --device cuda --batch_size 64 -python evals/lm_harness_eval.py --model mamba --model_args pretrained=state-spaces/mamba-2.8b-slimpj --tasks mmlu --num_fewshot 5 --device cuda --batch_size 64 +lm_eval --model mamba_ssm --model_args pretrained=state-spaces/mamba-2.8b-slimpj --tasks boolq,piqa,hellaswag,winogrande,arc_easy,arc_challenge,openbookqa,race,truthfulqa_mc2 --device cuda --batch_size 256 +lm_eval --model mamba_ssm --model_args pretrained=state-spaces/mamba-2.8b-slimpj --tasks mmlu --num_fewshot 5 --device cuda --batch_size 256 ``` Note that the result of each task might differ from reported values by 0.1-0.3 due to noise in the evaluation process.