Server Errors with Lorax Option: Cache Block Limitations and Incorrect Argument Count #245

seekeramento · 2024-02-14T13:03:36Z

System Info

latest

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

I've encountered two distinct server errors while using the Lorax option under specific settings. These errors seem to stem from cache block limitations and a mismatch in the expected number of arguments for a function. Below, I provide detailed descriptions of the errors, my settings, and steps to reproduce them. I hope to gain insights into whether these issues arise from my configurations or if they are known problems with potential workarounds or fixes.

Errors Encountered:
Cache Block Limitation Error: "Request failed during generation: Server error: Out of available cache blocks: asked 1024, only 172 free blocks"
Argument Count Mismatch Error: "Request failed during generation: Server error: FlashCausalLMBatch.concatenate() takes 2 positional arguments but 4 were given"
Steps to Reproduce:
Configure the system with the following settings:
--quantize awq
--max-input-length 8096
--max-total-tokens 18000
--max-batch-total-tokens 18000
--max-batch-prefill-tokens 18000
--sharded false

Here are the settings:
"--quantize", "awq",
"--max-input-length", "8096",
"--max-total-tokens", "18000",
"--max-batch-total-tokens", "18000",
"--max-batch-prefill-tokens", "18000",
"--sharded", "false"

Expected behavior

The system should handle the provided settings without resulting in server errors, allowing content generation to proceed without cache block limitations or argument count mismatches.

Instead, the generation requests fail, resulting in two types of server errors related to cache block limitations and incorrect argument count for the FlashCausalLMBatch.concatenate() function.

magdyksaleh · 2024-02-15T20:58:49Z

Hey @seekeramento can you please provide some more repro steps? Which args did you use to initialize the deployment. Specifically can you share:

The command you used to start the deployment?
The specs of the computer you are running this on?
How you tried to query the model?

Thanks!

seekeramento · 2024-02-16T10:53:37Z

Command to Start the Deployment:

I used Kubernetes to deploy the model, specifying the necessary arguments directly in the args section of my container configuration. This included details like the model ID, quantization method, and token limits. I also set an environment variable for the port (PORT=8080).

image: ghcr.io/predibase/lorax:latest
container disk: 20G
volume disk: 200G
volume mount path: /data

MODEL_ID: TheBloke/Mistral-7B-v0.1-AWQ
--quantize awq
--max-input-length 8096
--max-total-tokens 18000
--max-batch-total-tokens 18000
--max-batch-prefill-tokens 18000
--sharded false

Specs of the Computer Running This:

The deployment is running on an AWS g5.4xlarge instance.

How I Tried to Query the Model:

I queried the model by sending a JSON payload to the model's serving endpoint. This payload included the instruction and parameters like max_new_tokens to control the output's length.

DhruvaBansal00 · 2024-02-19T22:11:19Z

I am seeing this happen as well, especially when making concurrent requests to the same model. My command to start the docker container:

docker run --gpus all --shm-size 1g -p 80:80 -v /data/trained-checkpoint:/trained-checkpoint ghcr.io/predibase/lorax:latest --model-id /trained-checkpoint --dtype float16 --port 80 --num-shard 4 --max-input-length 3500 --max-total-tokens 7000

and then simply making 5 identical concurrent requests leads to "Request failed during generation: Server error: FlashCausalLMBatch.concatenate() takes 2 positional arguments but 4 were given" on 4 of them (and 1 of them succeeds).

DhruvaBansal00 · 2024-02-19T22:51:46Z

Okay found a fix - this is happening because --max-batch-total-tokens was getting set to a huge value and instead of going onto the queue, things were getting processed as a part of the same batch (leading to OOM issues). However, I find it weird that this gets surfaced as FlashCausalLMBatch.concatenate() takes 2 positional arguments but 4 were given.

Explicitly setting max-batch-total-tokens to a smaller value fixed it. Would recommend reducing that @seekeramento

EDIT: Setting max-batch-total-tokens to anything beyond 4096 is leading to the same error. Monitoring the GPU memory usage and as well and I don't actually see it going anywhere close to the limit. This might be a bug.

DhruvaBansal00 · 2024-02-19T23:08:36Z

Continuing to look into this -

with a longer prompt (~3k tokens), the concurrent requests get processed successfully with max-batch-total-tokens set to 4096. Note that with these settings, there is only 1 element in each batch.

with a smaller prompt (~30 tokens), the concurrent requests start erroring out again with the message FlashCausalLMBatch.concatenate() takes 2 positional arguments but 4 were given. With this input, the batch contains all the concurrent requests.

DhruvaBansal00 · 2024-02-20T00:13:02Z

I think #254 is supposed to fix this ticket. I tried the new docker image but am getting a different error with the longer prompts and max-batch-total-tokens set to a number that leads to multiple elements in a batch:

2024-02-20T00:11:45.942929Z ERROR lorax_launcher: interceptor.py:41 Method Decode encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/lorax-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 299, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/lorax_server/interceptor.py", line 38, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 117, in Decode
    batch = self.model.batch_type.concatenate(batches)
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 618, in concatenate
    next_token_chooser.schema_processor = HeterogeneousSchemaLogitsProcessor.concatenate(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/logits_process.py", line 467, in concatenate
    ret.sequence_processors.extend(p.sequence_processors)
AttributeError: 'NoneType' object has no attribute 'sequence_processors'

tgaddair · 2024-02-20T20:55:23Z

Hey @seekeramento @DhruvaBansal00 , apologies for the bug introduced with the Outlines integration. This issue only arises from stress tests, which we don't run on ever commit automatically, so sorry it slipped through the cracks!

I've put up #263 which addressed the issues for me when running a stress test locally. I was able to repro the issues observed above, so my hope is that we can close this out once this lands, but please let me know if you see any additional issues after this change.

prd-tuong-nguyen · 2024-06-14T06:56:58Z

@tgaddair I just got this error when stress test with Long Prompt input (around 1100 token, GPU A10G, Open-Orca/Mistral-7B-OpenOrca, --max-batch-prefill-tokens 16384)
{"timestamp":"2024-06-14T06:36:26.429192Z","level":"ERROR","message":"Server error: Out of available cache blocks: asked 1024, only 768 free blocks","target":"lorax_client","filename":"router/client/src/lib.rs","line_number":34,"span":{"id":2497,"size":4,"name":"prefill"},"spans":[{"batch_size":4,"name":"batch"},{"name":"prefill"},{"id":2497,"size":4,"name":"prefill"},{"id":2497,"size":4,"name":"prefill"}]}
Can you help me check?

DhruvaBansal00 mentioned this issue Feb 20, 2024

Creating empty HeterogeneousSchemaLogitsProcessor instead of None #260

Closed

3 tasks

tgaddair mentioned this issue Feb 20, 2024

Fixed batch merging and filtering to handle Outlines state #263

Merged

tgaddair closed this as completed in #263 Feb 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server Errors with Lorax Option: Cache Block Limitations and Incorrect Argument Count #245

Server Errors with Lorax Option: Cache Block Limitations and Incorrect Argument Count #245

seekeramento commented Feb 14, 2024

magdyksaleh commented Feb 15, 2024

seekeramento commented Feb 16, 2024

DhruvaBansal00 commented Feb 19, 2024

DhruvaBansal00 commented Feb 19, 2024 •

edited

Loading

DhruvaBansal00 commented Feb 19, 2024

DhruvaBansal00 commented Feb 20, 2024

tgaddair commented Feb 20, 2024

prd-tuong-nguyen commented Jun 14, 2024

Server Errors with Lorax Option: Cache Block Limitations and Incorrect Argument Count #245

Server Errors with Lorax Option: Cache Block Limitations and Incorrect Argument Count #245

Comments

seekeramento commented Feb 14, 2024

System Info

Information

Tasks

Reproduction

Expected behavior

magdyksaleh commented Feb 15, 2024

seekeramento commented Feb 16, 2024

DhruvaBansal00 commented Feb 19, 2024

DhruvaBansal00 commented Feb 19, 2024 • edited Loading

DhruvaBansal00 commented Feb 19, 2024

DhruvaBansal00 commented Feb 20, 2024

tgaddair commented Feb 20, 2024

prd-tuong-nguyen commented Jun 14, 2024

DhruvaBansal00 commented Feb 19, 2024 •

edited

Loading