- Using multinomial, temp = 1
{
"exact": 79.58392992504001,
"f1": 84.2160752961748,
"total": 11873,
"HasAns_exact": 75.37112010796221,
"HasAns_f1": 84.64869466792929,
"HasAns_total": 5928,
"NoAns_exact": 83.78469301934399,
"NoAns_f1": 83.78469301934399,
"NoAns_total": 5945
}
- Using argmax, temp = 1
{
"exact": 83.2729722900699,
"f1": 86.67488213657204,
"total": 11873,
"HasAns_exact": 81.08974358974359,
"HasAns_f1": 87.90331909708533,
"HasAns_total": 5928,
"NoAns_exact": 85.44995794785534,
"NoAns_f1": 85.44995794785534,
"NoAns_total": 5945
}
We have tested for a few weights and argmax performance is always superior which is expected as this is an extractive task (answer is found in context). All generation for extractive QA will be done via argmax.
- int8
{
"exact": 82.70024425166343,
"f1": 86.23343966074528,
"total": 11873,
"HasAns_exact": 80.8029689608637,
"HasAns_f1": 87.8794920870496,
"HasAns_total": 5928,
"NoAns_exact": 84.59209419680404,
"NoAns_f1": 84.59209419680404,
"NoAns_total": 5945
}
- int4 (GPTQ)
{
"exact": 81.31895898256549,
"f1": 85.0672091132973,
"total": 11873,
"HasAns_exact": 79.47031039136303,
"HasAns_f1": 86.97755968322865,
"HasAns_total": 5928,
"NoAns_exact": 83.16232127838519,
"NoAns_f1": 83.16232127838519,
"NoAns_total": 5945
}
{
"exact": 84.85639686684073,
"f1": 88.12948928646375,
"total": 11873,
"HasAns_exact": 80.41497975708502,
"HasAns_f1": 86.9705509949707,
"HasAns_total": 5928,
"NoAns_exact": 89.28511354079058,
"NoAns_f1": 89.28511354079058,
"NoAns_total": 5945
}
Note: Got room for improvement in training, our validation interval to save checkpoint was set too big due to storage concern as each weight saved is 14GB (7B model). It ended up with only the very first checkpoint being saved as all validation loss after the first validation interval was higher.
{
"exact": 87.56843257811842,
"f1": 90.14054761949711,
"total": 11873,
"HasAns_exact": 82.86099865047234,
"HasAns_f1": 88.01260490659415,
"HasAns_total": 5928,
"NoAns_exact": 92.26240538267452,
"NoAns_f1": 92.26240538267452,
"NoAns_total": 5945
}
- Default
{
"exact": 84.14048681883264,
"f1": 86.66208107441904,
"total": 11873,
"HasAns_exact": 75.72537112010797,
"HasAns_f1": 80.77579092384913,
"HasAns_total": 5928,
"NoAns_exact": 92.53153910849453,
"NoAns_f1": 92.53153910849453,
"NoAns_total": 5945
}
- int8
{
"exact": 84.02257222269014,
"f1": 86.69351157220103,
"total": 11873,
"HasAns_exact": 76.1808367071525,
"HasAns_f1": 81.53037498258149,
"HasAns_total": 5928,
"NoAns_exact": 91.84188393608073,
"NoAns_f1": 91.84188393608073,
"NoAns_total": 5945
}
Tested with micro batch size of 1 and batch size of 128 using Gradient Accumulation.
- LLaMA 7B with context length of 512
- LoRA: ~20GB
- Full-finetuning (FSDP): ~80GB (Single GPU, 80GB A100) but does not work on ~88GB (4x GPU, 22GB A5000)
- LLaMA 13B with context length of 512
- LoRA: ~40GB
- LLaMA 30B with context length of 512
- LoRA: ~75GB
Fix context length to 2048.
Tested with bfloat16:
- 7B: ~13GB
- 13B: ~26GB
- 30B: ~60GB
- 65B: [Cannot fit into one GPU]
Tested with int8:
- 7B: ~6GB
- 13B: ~15GB
- 30B: ~35GB
- 65B: ~69GB