Unable to load Llama-2-13b with max_input_length > context_length (4096 tokens) #348

rajasbansal · 2024-03-20T17:21:59Z

System Info

lorax:latest with docker

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

sudo docker run --gpus all -e DISABLE_SGMV=1 -e ROPE_SCALING=linear -e ROPE_FACTOR=4 --shm-size 1g -p 80:80 ghcr.io/predibase/lorax:latest --model-id NousResearch/Llama-2-13b-hf --dtype float16 --port 80 --num-shard 4 --max-input-length 8000 --max-total-tokens 8002 --max-batch-prefill-tokens 8000

Expected behavior

Here I am trying to use linear rope scaling to use Llama-2-13b on longer input sizes than 4096 tokens, however I see that the model fails to come up during warmup. It seems to be related to the max posiition embeddings in the Llama Config. When I set that value to a higher value, it works as expected as long as max_total_tokens <= max_positiion_embeddings. If max_total_tokens > max_positiion_embeddings and max_input_length < max_positiion_embeddings, then the model comes up during warmup but hangs when making a call to the model. ile

tgaddair · 2024-03-21T00:10:13Z

Hey @rajasbansal, I just put up PR #350 which should fix this issue. However, I could not repro the hanging issue you saw after fixing the max position embeddings. I suspect the issue there is related to the available memory on the device somehow. What kind of GPUs are you running on?

rajasbansal mentioned this issue Mar 20, 2024

Lorax server hangs on warming up the server when total_tokens is larger than 4096 with LLama2-7b-hf on a l4 GPU #334

Closed

4 tasks

tgaddair added the bug Something isn't working label Mar 20, 2024

tgaddair self-assigned this Mar 20, 2024

tgaddair mentioned this issue Mar 21, 2024

Fix dynamic RoPE #350

Merged

tgaddair closed this as completed in #350 Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to load Llama-2-13b with max_input_length > context_length (4096 tokens) #348

Unable to load Llama-2-13b with max_input_length > context_length (4096 tokens) #348

rajasbansal commented Mar 20, 2024 •

edited

Loading

tgaddair commented Mar 21, 2024

Unable to load Llama-2-13b with max_input_length > context_length (4096 tokens) #348

Unable to load Llama-2-13b with max_input_length > context_length (4096 tokens) #348

Comments

rajasbansal commented Mar 20, 2024 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

tgaddair commented Mar 21, 2024

rajasbansal commented Mar 20, 2024 •

edited

Loading