Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend accuracy tests for models that we support #824

Open
wants to merge 9 commits into
base: habana_main
Choose a base branch
from
11 changes: 11 additions & 0 deletions .jenkins/lm-eval-harness/configs/Mistral-7B-Instruct-v0.3.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
model_name: "/mnt/weka/data/pytorch/mistral/Mistral-7B-Instruct-v0.3"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AnetaKaczynska I was rather more interested in the time taken by the test itself, those measurements are for whole suite including environment preparation (which can vary for many reasons, and will vary also for existing suites). The exact time for the suite can be measured by taking the "============================= test session starts ==============================" time and test finish

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suite model time1 time2
gsm8k_small_g3_tp1_part2 granite-8b.yaml 00:01:21 00:01:24
gsm8k_small_g3_tp1_part2 granite-20b.yaml 00:01:48 00:01:51
gsm8k_small_g3_tp1_part3 Qwen2-7b-Instruct.yaml 00:01:13 00:01:16
gsm8k_small_g3_tp1_part3 Mistral-7B-Instruct-v0.3.yaml 00:01:13 00:01:16
gsm8k_large_g3_tp2_part2 Mixtral-8x7B-Instruct-v0.1.yaml 00:01:39 00:06:29
gsm8k_small_g3_tp1_fp8 granite-8b-fp8.yaml 00:01:23 00:05:42
gsm8k_small_g3_tp1_fp8 granite-20b-fp8.yaml 00:01:51 00:05:38

Ok, for clarity I gathered two values:

  1. test time as reported directly in logs (e.g. ================== 1 passed, 3 warnings in 111.16s (0:01:51) ===================)
  2. time taken between 'test session starts' and 'PASSED MODEL', which for large and fp8 models is usually several minutes longer.

tasks:
- name: "gsm8k_cot"
metrics:
- name: "exact_match,strict-match"
value: 0.4905
- name: "exact_match,flexible-extract"
value: 0.5284
limit: 500
num_fewshot: 8
dtype: "bfloat16"
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
model_name: "/mnt/weka/data/mlperf_models/Mixtral-8x7B-Instruct-v0.1"
tasks:
- name: "gsm8k_cot"
metrics:
- name: "exact_match,strict-match"
value: 0.6967
- name: "exact_match,flexible-extract"
value: 0.6952
limit: 250
num_fewshot: 8
dtype: "bfloat16"
11 changes: 11 additions & 0 deletions .jenkins/lm-eval-harness/configs/Qwen2-7b-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
model_name: "/mnt/weka/data/pytorch/Qwen/Qwen2-7b-Instruct"
tasks:
- name: "gsm8k_cot"
metrics:
- name: "exact_match,strict-match"
value: 0.6565
- name: "exact_match,flexible-extract"
value: 0.7778
limit: 500
num_fewshot: 8
dtype: "bfloat16"
12 changes: 12 additions & 0 deletions .jenkins/lm-eval-harness/configs/granite-20b-fp8.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
model_name: "/mnt/weka/data/pytorch/granite/granite-20b"
tasks:
- name: "gsm8k_cot"
metrics:
- name: "exact_match,strict-match"
value: 0.5291
- name: "exact_match,flexible-extract"
value: 0.5564
limit: 500
num_fewshot: 8
dtype: "bfloat16"
fp8: true
11 changes: 11 additions & 0 deletions .jenkins/lm-eval-harness/configs/granite-20b.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
model_name: "/mnt/weka/data/pytorch/granite/granite-20b"
tasks:
- name: "gsm8k_cot"
metrics:
- name: "exact_match,strict-match"
value: 0.5443
- name: "exact_match,flexible-extract"
value: 0.5670
limit: 500
num_fewshot: 8
dtype: "bfloat16"
12 changes: 12 additions & 0 deletions .jenkins/lm-eval-harness/configs/granite-8b-fp8.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
model_name: "/mnt/weka/data/pytorch/granite/granite-8b"
tasks:
- name: "gsm8k_cot"
metrics:
- name: "exact_match,strict-match"
value: 0.6376
- name: "exact_match,flexible-extract"
value: 0.6497
limit: 500
num_fewshot: 8
dtype: "bfloat16"
fp8: true
11 changes: 11 additions & 0 deletions .jenkins/lm-eval-harness/configs/granite-8b.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
model_name: "/mnt/weka/data/pytorch/granite/granite-8b"
tasks:
- name: "gsm8k_cot"
metrics:
- name: "exact_match,strict-match"
value: 0.6542
- name: "exact_match,flexible-extract"
value: 0.6686
limit: 500
num_fewshot: 8
dtype: "bfloat16"
3 changes: 3 additions & 0 deletions .jenkins/lm-eval-harness/configs/models-fp8-g3-tp1.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Meta-Llama-3.1-8B-Instruct-fp8.yaml
granite-8b-fp8.yaml
granite-20b-fp8.yaml
1 change: 1 addition & 0 deletions .jenkins/lm-eval-harness/configs/models-large-2.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Mixtral-8x7B-Instruct-v0.1.yaml
2 changes: 2 additions & 0 deletions .jenkins/lm-eval-harness/configs/models-small-2.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
granite-8b.yaml
granite-20b.yaml
2 changes: 2 additions & 0 deletions .jenkins/lm-eval-harness/configs/models-small-3.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Qwen2-7b-Instruct.yaml
Mistral-7B-Instruct-v0.3.yaml
15 changes: 12 additions & 3 deletions .jenkins/test_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,15 @@
stages:
- name: test_gsm8k_small_models
steps:
- name: gsm8k_small_g3_tp1
- name: gsm8k_small_g3_tp1_part1
flavor: g3
command: cd .jenkins/lm-eval-harness && bash run-tests.sh -c configs/models-small.txt -t 1
- name: gsm8k_small_g3_tp1_part2
flavor: g3
command: cd .jenkins/lm-eval-harness && bash run-tests.sh -c configs/models-small-2.txt -t 1
- name: gsm8k_small_g3_tp1_part3
flavor: g3
command: cd .jenkins/lm-eval-harness && bash run-tests.sh -c configs/models-small-3.txt -t 1
- name: gsm8k_small_g3_tp2
flavor: g3.s
command: cd .jenkins/lm-eval-harness && bash run-tests.sh -c configs/models-small.txt -t 2
Expand All @@ -16,17 +22,20 @@ stages:
command: cd .jenkins/lm-eval-harness && bash run-tests.sh -c configs/models-small.txt -t 2
- name: test_gsm8k_large_models
steps:
- name: gsm8k_large_g3_tp2
- name: gsm8k_large_g3_tp2_part1
flavor: g3.s
command: cd .jenkins/lm-eval-harness && bash run-tests.sh -c configs/models-large.txt -t 2
- name: gsm8k_large_g3_tp2_part2
flavor: g3.s
command: cd .jenkins/lm-eval-harness && bash run-tests.sh -c configs/models-large-2.txt -t 2
- name: gsm8k_large_g2_tp4
flavor: g2.m
command: cd .jenkins/lm-eval-harness && bash run-tests.sh -c configs/models-large.txt -t 4
- name: test_gsm8k_fp8
steps:
- name: gsm8k_small_g3_tp1_fp8
flavor: g3
command: cd .jenkins/lm-eval-harness && bash run-tests.sh -c configs/models-fp8.txt -t 1
command: cd .jenkins/lm-eval-harness && bash run-tests.sh -c configs/models-fp8-g3-tp1.txt -t 1
- name: gsm8k_small_g3_tp2_fp8
flavor: g3.s
command: cd .jenkins/lm-eval-harness && bash run-tests.sh -c configs/models-fp8.txt -t 2
Expand Down
Loading