-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server Errors with Lorax Option: Cache Block Limitations and Incorrect Argument Count #245
Comments
Hey @seekeramento can you please provide some more repro steps? Which args did you use to initialize the deployment. Specifically can you share:
Thanks! |
I used Kubernetes to deploy the model, specifying the necessary arguments directly in the args section of my container configuration. This included details like the model ID, quantization method, and token limits. I also set an environment variable for the port (PORT=8080). image: ghcr.io/predibase/lorax:latest MODEL_ID: TheBloke/Mistral-7B-v0.1-AWQ
The deployment is running on an AWS g5.4xlarge instance.
I queried the model by sending a JSON payload to the model's serving endpoint. This payload included the instruction and parameters like max_new_tokens to control the output's length. |
I am seeing this happen as well, especially when making concurrent requests to the same model. My command to start the docker container:
and then simply making 5 identical concurrent requests leads to "Request failed during generation: Server error: FlashCausalLMBatch.concatenate() takes 2 positional arguments but 4 were given" on 4 of them (and 1 of them succeeds). |
Okay found a fix - this is happening because --max-batch-total-tokens was getting set to a huge value and instead of going onto the queue, things were getting processed as a part of the same batch (leading to OOM issues). However, I find it weird that this gets surfaced as Explicitly setting max-batch-total-tokens to a smaller value fixed it. Would recommend reducing that @seekeramento EDIT: Setting max-batch-total-tokens to anything beyond 4096 is leading to the same error. Monitoring the GPU memory usage and as well and I don't actually see it going anywhere close to the limit. This might be a bug. |
Continuing to look into this - with a longer prompt (~3k tokens), the concurrent requests get processed successfully with with a smaller prompt (~30 tokens), the concurrent requests start erroring out again with the message |
I think #254 is supposed to fix this ticket. I tried the new docker image but am getting a different error with the longer prompts and max-batch-total-tokens set to a number that leads to multiple elements in a batch:
|
Hey @seekeramento @DhruvaBansal00 , apologies for the bug introduced with the Outlines integration. This issue only arises from stress tests, which we don't run on ever commit automatically, so sorry it slipped through the cracks! I've put up #263 which addressed the issues for me when running a stress test locally. I was able to repro the issues observed above, so my hope is that we can close this out once this lands, but please let me know if you see any additional issues after this change. |
@tgaddair I just got this error when stress test with Long Prompt input (around 1100 token, GPU A10G, Open-Orca/Mistral-7B-OpenOrca, |
System Info
latest
Information
Tasks
Reproduction
I've encountered two distinct server errors while using the Lorax option under specific settings. These errors seem to stem from cache block limitations and a mismatch in the expected number of arguments for a function. Below, I provide detailed descriptions of the errors, my settings, and steps to reproduce them. I hope to gain insights into whether these issues arise from my configurations or if they are known problems with potential workarounds or fixes.
Errors Encountered:
Cache Block Limitation Error: "Request failed during generation: Server error: Out of available cache blocks: asked 1024, only 172 free blocks"
Argument Count Mismatch Error: "Request failed during generation: Server error: FlashCausalLMBatch.concatenate() takes 2 positional arguments but 4 were given"
Steps to Reproduce:
Configure the system with the following settings:
--quantize awq
--max-input-length 8096
--max-total-tokens 18000
--max-batch-total-tokens 18000
--max-batch-prefill-tokens 18000
--sharded false
Here are the settings:
"--quantize", "awq",
"--max-input-length", "8096",
"--max-total-tokens", "18000",
"--max-batch-total-tokens", "18000",
"--max-batch-prefill-tokens", "18000",
"--sharded", "false"
Expected behavior
The system should handle the provided settings without resulting in server errors, allowing content generation to proceed without cache block limitations or argument count mismatches.
Instead, the generation requests fail, resulting in two types of server errors related to cache block limitations and incorrect argument count for the FlashCausalLMBatch.concatenate() function.
The text was updated successfully, but these errors were encountered: