Unable to load Llama-2-13b with max_input_length > context_length (4096 tokens) #348
Closed
2 of 4 tasks
Labels
bug
Something isn't working
System Info
lorax:latest with docker
Information
Tasks
Reproduction
sudo docker run --gpus all -e DISABLE_SGMV=1 -e ROPE_SCALING=linear -e ROPE_FACTOR=4 --shm-size 1g -p 80:80 ghcr.io/predibase/lorax:latest --model-id NousResearch/Llama-2-13b-hf --dtype float16 --port 80 --num-shard 4 --max-input-length 8000 --max-total-tokens 8002 --max-batch-prefill-tokens 8000
Expected behavior
Here I am trying to use linear rope scaling to use Llama-2-13b on longer input sizes than 4096 tokens, however I see that the model fails to come up during warmup. It seems to be related to the max posiition embeddings in the Llama Config. When I set that value to a higher value, it works as expected as long as max_total_tokens <= max_positiion_embeddings. If max_total_tokens > max_positiion_embeddings and max_input_length < max_positiion_embeddings, then the model comes up during warmup but hangs when making a call to the model. ile
The text was updated successfully, but these errors were encountered: