Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server Errors with Lorax Option: Cache Block Limitations and Incorrect Argument Count #245

Closed
4 tasks
seekeramento opened this issue Feb 14, 2024 · 8 comments · Fixed by #263
Closed
4 tasks

Comments

@seekeramento
Copy link

System Info

latest

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

I've encountered two distinct server errors while using the Lorax option under specific settings. These errors seem to stem from cache block limitations and a mismatch in the expected number of arguments for a function. Below, I provide detailed descriptions of the errors, my settings, and steps to reproduce them. I hope to gain insights into whether these issues arise from my configurations or if they are known problems with potential workarounds or fixes.

Errors Encountered:
Cache Block Limitation Error: "Request failed during generation: Server error: Out of available cache blocks: asked 1024, only 172 free blocks"
Argument Count Mismatch Error: "Request failed during generation: Server error: FlashCausalLMBatch.concatenate() takes 2 positional arguments but 4 were given"
Steps to Reproduce:
Configure the system with the following settings:
--quantize awq
--max-input-length 8096
--max-total-tokens 18000
--max-batch-total-tokens 18000
--max-batch-prefill-tokens 18000
--sharded false

Here are the settings:
"--quantize", "awq",
"--max-input-length", "8096",
"--max-total-tokens", "18000",
"--max-batch-total-tokens", "18000",
"--max-batch-prefill-tokens", "18000",
"--sharded", "false"

Expected behavior

The system should handle the provided settings without resulting in server errors, allowing content generation to proceed without cache block limitations or argument count mismatches.

Instead, the generation requests fail, resulting in two types of server errors related to cache block limitations and incorrect argument count for the FlashCausalLMBatch.concatenate() function.

@magdyksaleh
Copy link
Collaborator

Hey @seekeramento can you please provide some more repro steps? Which args did you use to initialize the deployment. Specifically can you share:

  1. The command you used to start the deployment?
  2. The specs of the computer you are running this on?
  3. How you tried to query the model?

Thanks!

@seekeramento
Copy link
Author

  1. Command to Start the Deployment:

I used Kubernetes to deploy the model, specifying the necessary arguments directly in the args section of my container configuration. This included details like the model ID, quantization method, and token limits. I also set an environment variable for the port (PORT=8080).

image: ghcr.io/predibase/lorax:latest
container disk: 20G
volume disk: 200G
volume mount path: /data

MODEL_ID: TheBloke/Mistral-7B-v0.1-AWQ
--quantize awq
--max-input-length 8096
--max-total-tokens 18000
--max-batch-total-tokens 18000
--max-batch-prefill-tokens 18000
--sharded false

  1. Specs of the Computer Running This:

The deployment is running on an AWS g5.4xlarge instance.

  1. How I Tried to Query the Model:

I queried the model by sending a JSON payload to the model's serving endpoint. This payload included the instruction and parameters like max_new_tokens to control the output's length.

@DhruvaBansal00
Copy link
Contributor

I am seeing this happen as well, especially when making concurrent requests to the same model. My command to start the docker container:

docker run --gpus all --shm-size 1g -p 80:80 -v /data/trained-checkpoint:/trained-checkpoint ghcr.io/predibase/lorax:latest --model-id /trained-checkpoint --dtype float16 --port 80 --num-shard 4 --max-input-length 3500 --max-total-tokens 7000

and then simply making 5 identical concurrent requests leads to "Request failed during generation: Server error: FlashCausalLMBatch.concatenate() takes 2 positional arguments but 4 were given" on 4 of them (and 1 of them succeeds).

@DhruvaBansal00
Copy link
Contributor

DhruvaBansal00 commented Feb 19, 2024

Okay found a fix - this is happening because --max-batch-total-tokens was getting set to a huge value and instead of going onto the queue, things were getting processed as a part of the same batch (leading to OOM issues). However, I find it weird that this gets surfaced as FlashCausalLMBatch.concatenate() takes 2 positional arguments but 4 were given.

Explicitly setting max-batch-total-tokens to a smaller value fixed it. Would recommend reducing that @seekeramento

EDIT: Setting max-batch-total-tokens to anything beyond 4096 is leading to the same error. Monitoring the GPU memory usage and as well and I don't actually see it going anywhere close to the limit. This might be a bug.

@DhruvaBansal00
Copy link
Contributor

Continuing to look into this -

with a longer prompt (~3k tokens), the concurrent requests get processed successfully with max-batch-total-tokens set to 4096. Note that with these settings, there is only 1 element in each batch.

with a smaller prompt (~30 tokens), the concurrent requests start erroring out again with the message FlashCausalLMBatch.concatenate() takes 2 positional arguments but 4 were given. With this input, the batch contains all the concurrent requests.

@DhruvaBansal00
Copy link
Contributor

I think #254 is supposed to fix this ticket. I tried the new docker image but am getting a different error with the longer prompts and max-batch-total-tokens set to a number that leads to multiple elements in a batch:

2024-02-20T00:11:45.942929Z ERROR lorax_launcher: interceptor.py:41 Method Decode encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/lorax-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 299, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/lorax_server/interceptor.py", line 38, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 117, in Decode
    batch = self.model.batch_type.concatenate(batches)
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 618, in concatenate
    next_token_chooser.schema_processor = HeterogeneousSchemaLogitsProcessor.concatenate(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/logits_process.py", line 467, in concatenate
    ret.sequence_processors.extend(p.sequence_processors)
AttributeError: 'NoneType' object has no attribute 'sequence_processors'

@tgaddair
Copy link
Contributor

Hey @seekeramento @DhruvaBansal00 , apologies for the bug introduced with the Outlines integration. This issue only arises from stress tests, which we don't run on ever commit automatically, so sorry it slipped through the cracks!

I've put up #263 which addressed the issues for me when running a stress test locally. I was able to repro the issues observed above, so my hope is that we can close this out once this lands, but please let me know if you see any additional issues after this change.

@prd-tuong-nguyen
Copy link

@tgaddair I just got this error when stress test with Long Prompt input (around 1100 token, GPU A10G, Open-Orca/Mistral-7B-OpenOrca, --max-batch-prefill-tokens 16384)
{"timestamp":"2024-06-14T06:36:26.429192Z","level":"ERROR","message":"Server error: Out of available cache blocks: asked 1024, only 768 free blocks","target":"lorax_client","filename":"router/client/src/lib.rs","line_number":34,"span":{"id":2497,"size":4,"name":"prefill"},"spans":[{"batch_size":4,"name":"batch"},{"name":"prefill"},{"id":2497,"size":4,"name":"prefill"},{"id":2497,"size":4,"name":"prefill"}]}
Can you help me check?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants