Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long prompt with DeepSeek crashing with tensor size mismatch #101

Open
bitbottrap opened this issue Oct 14, 2024 · 11 comments
Open

Long prompt with DeepSeek crashing with tensor size mismatch #101

bitbottrap opened this issue Oct 14, 2024 · 11 comments

Comments

@bitbottrap
Copy link

Using: KTransformers REST API
Model: DeepSeek Coder V2 236B Q8
I changed the following settings in args.py:
max_new_tokens to 16384
max_response_tokens to 16384
cache_q4 to False

I was attempting to increase the size of the response from the REST API for coding purposes. I changed the cache quantization while I was at it too. Don't think it makes much of a performance difference for me and I have system RAM available.

(I have not tracked down or understand the difference, or need, for both max_new_tokens versus max_response_tokens? From their descriptions it sounds like they do the same thing.)

What I did:
KTransformers successfully completed a significant prompt (the first prompt provided to the server) and produced a lengthy and complete response. When I attempted a follow up (the second prompt provided to the server) prompt I get the following messages in my logs and KTransformers server is unresponsive:

Oct 14 11:38:20 server ktransformers[119769]: INFO:     10.0.100.108:45988 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Oct 14 11:38:20 server ktransformers[119769]: 2024-10-14 11:38:20,453 DEBUG /home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/ktransformers/server/backend/interfaces/transformers.py[185]: get input ids of shape torch.Size([1, 4119])
Oct 14 11:38:20 server ktransformers[119769]: 2024-10-14 11:38:20,453 DEBUG /home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/ktransformers/server/backend/interfaces/transformers.py[240]: input_ids: torch.Size([1, 4119])
Oct 14 11:38:20 server ktransformers[119769]: 2024-10-14 11:38:20,454 DEBUG /home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/ktransformers/server/backend/interfaces/transformers.py[262]: cache position: 0 to 4119
Oct 14 11:38:20 server ktransformers[119769]: ERROR:    Exception in ASGI application
Oct 14 11:38:20 server ktransformers[119769]: Traceback (most recent call last):
Oct 14 11:38:20 server ktransformers[119769]:   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/responses.py", line 259, in __call__
Oct 14 11:38:20 server ktransformers[119769]:     await wrap(partial(self.listen_for_disconnect, receive))
Oct 14 11:38:20 server ktransformers[119769]:   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/responses.py", line 255, in wrap
Oct 14 11:38:20 server ktransformers[119769]:     await func()
Oct 14 11:38:20 server ktransformers[119769]:   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/responses.py", line 232, in listen_for_disconnect
Oct 14 11:38:20 server ktransformers[119769]:     message = await receive()
Oct 14 11:38:20 server ktransformers[119769]:               ^^^^^^^^^^^^^^^
Oct 14 11:38:20 server ktransformers[119769]:   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/uvicorn/protocols/http/h11_impl.py", line 534, in receive
Oct 14 11:38:20 server ktransformers[119769]:     await self.message_event.wait()
Oct 14 11:38:20 server ktransformers[119769]:   File "/usr/lib/python3.12/asyncio/locks.py", line 212, in wait
Oct 14 11:38:20 server ktransformers[119769]:     await fut
Oct 14 11:38:20 server ktransformers[119769]: asyncio.exceptions.CancelledError: Cancelled by cancel scope 7df167feed50
Oct 14 11:38:20 server ktransformers[119769]: During handling of the above exception, another exception occurred:
Oct 14 11:38:20 server ktransformers[119769]:   + Exception Group Traceback (most recent call last):
Oct 14 11:38:20 server ktransformers[119769]:   |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/uvicorn/protocols/http/h11_impl.py", line 406, in run_asgi
Oct 14 11:38:20 server ktransformers[119769]:   |     result = await app(  # type: ignore[func-returns-value]
Oct 14 11:38:20 server ktransformers[119769]:   |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Oct 14 11:38:20 server ktransformers[119769]:   |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
Oct 14 11:38:20 server ktransformers[119769]:   |     return await self.app(scope, receive, send)
Oct 14 11:38:20 server ktransformers[119769]:   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Oct 14 11:38:20 server ktransformers[119769]:   |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/fastapi/applications.py", line 1054, in __call__
Oct 14 11:38:20 server ktransformers[119769]:   |     await super().__call__(scope, receive, send)
Oct 14 11:38:20 server ktransformers[119769]:   |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/applications.py", line 113, in __call__
Oct 14 11:38:20 server ktransformers[119769]:   |     await self.middleware_stack(scope, receive, send)
Oct 14 11:38:20 server ktransformers[119769]:   |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/middleware/errors.py", line 187, in __call__
Oct 14 11:38:20 server ktransformers[119769]:   |     raise exc
Oct 14 11:38:20 server ktransformers[119769]:   |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/middleware/errors.py", line 165, in __call__
Oct 14 11:38:20 server ktransformers[119769]:   |     await self.app(scope, receive, _send)
Oct 14 11:38:20 server ktransformers[119769]:   |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/middleware/cors.py", line 85, in __call__
Oct 14 11:38:20 server ktransformers[119769]:   |     await self.app(scope, receive, send)
Oct 14 11:38:20 server ktransformers[119769]:   |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
Oct 14 11:38:20 server ktransformers[119769]:   |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
Oct 14 11:38:20 server ktransformers[119769]:   |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
Oct 14 11:38:20 server ktransformers[119769]:   |     raise exc
Oct 14 11:38:20 server ktransformers[119769]:   |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
Oct 14 11:38:20 server ktransformers[119769]:   |     await app(scope, receive, sender)
Oct 14 11:38:20 server ktransformers[119769]:   |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/routing.py", line 715, in __call__
Oct 14 11:38:20 server ktransformers[119769]:   |     await self.middleware_stack(scope, receive, send)
Oct 14 11:38:20 server ktransformers[119769]:   |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/routing.py", line 735, in app
Oct 14 11:38:20 server ktransformers[119769]:   |     await route.handle(scope, receive, send)
Oct 14 11:38:20 server ktransformers[119769]:   |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/routing.py", line 288, in handle
Oct 14 11:38:20 server ktransformers[119769]:   |     await self.app(scope, receive, send)
Oct 14 11:38:20 server ktransformers[119769]:   |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/routing.py", line 76, in app
Oct 14 11:38:20 server ktransformers[119769]:   |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
Oct 14 11:38:20 server ktransformers[119769]:   |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
Oct 14 11:38:20 server ktransformers[119769]:   |     raise exc
Oct 14 11:38:20 server ktransformers[119769]:   |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
Oct 14 11:38:20 server ktransformers[119769]:   |     await app(scope, receive, sender)
Oct 14 11:38:20 server ktransformers[119769]:   |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/routing.py", line 74, in app
Oct 14 11:38:20 server ktransformers[119769]:   |     await response(scope, receive, send)
Oct 14 11:38:20 server ktransformers[119769]:   |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/responses.py", line 252, in __call__
Oct 14 11:38:20 server ktransformers[119769]:   |     async with anyio.create_task_group() as task_group:
Oct 14 11:38:20 server ktransformers[119769]:   |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 736, in __aexit__
Oct 14 11:38:20 server ktransformers[119769]:   |     raise BaseExceptionGroup(
Oct 14 11:38:20 server ktransformers[119769]:   | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
Oct 14 11:38:20 server ktransformers[119769]:   +-+---------------- 1 ----------------
Oct 14 11:38:20 server ktransformers[119769]:     | Traceback (most recent call last):
Oct 14 11:38:20 server ktransformers[119769]:     |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/responses.py", line 255, in wrap
Oct 14 11:38:20 server ktransformers[119769]:     |     await func()
Oct 14 11:38:20 server ktransformers[119769]:     |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/responses.py", line 244, in stream_response
Oct 14 11:38:20 server ktransformers[119769]:     |     async for chunk in self.body_iterator:
Oct 14 11:38:20 server ktransformers[119769]:     |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 80, in check_client_link
Oct 14 11:38:20 server ktransformers[119769]:     |     async for event in async_events:
Oct 14 11:38:20 server ktransformers[119769]:     |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 93, in to_stream_reply
Oct 14 11:38:20 server ktransformers[119769]:     |     async for event in async_events:
Oct 14 11:38:20 server ktransformers[119769]:     |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 87, in add_done
Oct 14 11:38:20 server ktransformers[119769]:     |     async for event in async_events:
Oct 14 11:38:20 server ktransformers[119769]:     |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 107, in filter_chat_chunk
Oct 14 11:38:20 server ktransformers[119769]:     |     async for event in async_events:
Oct 14 11:38:20 server ktransformers[119769]:     |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/ktransformers/server/api/openai/endpoints/chat.py", line 26, in inner
Oct 14 11:38:20 server ktransformers[119769]:     |     async for token in interface.inference(input_message,id):
Oct 14 11:38:20 server ktransformers[119769]:     |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/ktransformers/server/backend/interfaces/transformers.py", line 323, in inference
Oct 14 11:38:20 server ktransformers[119769]:     |     for t in self.prefill(input_ids,self.check_is_new(thread_id)):
Oct 14 11:38:20 server ktransformers[119769]:     |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 36, in generator_context
Oct 14 11:38:20 server ktransformers[119769]:     |     response = gen.send(None)
Oct 14 11:38:20 server ktransformers[119769]:     |                ^^^^^^^^^^^^^^
Oct 14 11:38:20 server ktransformers[119769]:     |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/ktransformers/server/backend/interfaces/transformers.py", line 272, in prefill
Oct 14 11:38:20 server ktransformers[119769]:     |     logits = self.model(
Oct 14 11:38:20 server ktransformers[119769]:     |              ^^^^^^^^^^^
Oct 14 11:38:20 server ktransformers[119769]:     |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
Oct 14 11:38:20 server ktransformers[119769]:     |     return self._call_impl(*args, **kwargs)
Oct 14 11:38:20 server ktransformers[119769]:     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Oct 14 11:38:20 server ktransformers[119769]:     |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
Oct 14 11:38:20 server ktransformers[119769]:     |     return forward_call(*args, **kwargs)
Oct 14 11:38:20 server ktransformers[119769]:     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Oct 14 11:38:20 server ktransformers[119769]:     |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/ktransformers/models/modeling_deepseek.py", line 1731, in forward
Oct 14 11:38:20 server ktransformers[119769]:     |     outputs = self.model(
Oct 14 11:38:20 server ktransformers[119769]:     |               ^^^^^^^^^^^
Oct 14 11:38:20 server ktransformers[119769]:     |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
Oct 14 11:38:20 server ktransformers[119769]:     |     return self._call_impl(*args, **kwargs)
Oct 14 11:38:20 server ktransformers[119769]:     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Oct 14 11:38:20 server ktransformers[119769]:     |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
Oct 14 11:38:20 server ktransformers[119769]:     |     return forward_call(*args, **kwargs)
Oct 14 11:38:20 server ktransformers[119769]:     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Oct 14 11:38:20 server ktransformers[119769]:     |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/ktransformers/operators/models.py", line 651, in forward
Oct 14 11:38:20 server ktransformers[119769]:     |     causal_mask = self._update_causal_mask(
Oct 14 11:38:20 server ktransformers[119769]:     |                   ^^^^^^^^^^^^^^^^^^^^^^^^^
Oct 14 11:38:20 server ktransformers[119769]:     |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/ktransformers/models/modeling_deepseek.py", line 1624, in _update_causal_mask
Oct 14 11:38:20 server ktransformers[119769]:     |     padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :]
Oct 14 11:38:20 server ktransformers[119769]:     |                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Oct 14 11:38:20 server ktransformers[119769]:     | RuntimeError: The size of tensor a (4096) must match the size of tensor b (4119) at non-singleton dimension 3
Oct 14 11:38:20 server ktransformers[119769]:     +------------------------------------
@bitbottrap
Copy link
Author

bitbottrap commented Oct 14, 2024

I've reverted the changes one by one and the crash happens every time. It's being fed with a long prompt but ... I don't think [?] that should matter. Looks like a b00g to me. Works on other engines.

@bitbottrap bitbottrap changed the title Crash after a long response Long prompt with DeepSeek crashing with tensor size mismatch Oct 14, 2024
@qiyuxinlin
Copy link
Contributor

It seems that it is caused by the inconsistent length of the attention_mask passed in and the input token. Can you share the code for starting the program? Let me repeat the problem you encountered. After using the REST API to call the DeepSeek model, the program will report an error in the second round of dialogue?

@bitbottrap
Copy link
Author

The first time it crashed it was on the second prompt. The first prompt was simple and had a relatively short response. The second prompt was long and it crashed immediately when the request was made. No work was done.

I have found that the long prompt will crash KTransformers even if it is the first prompt after starting KTransformers. I am using the latest Continue.dev VSCode extension.

I have made no code changes and followed the build instructions precisely. The command line I'm starting KTransformers with is:
ktransformers.env/bin/ktransformers --port 8090 --cpu_infer 60 --model_path deepseek-ai/DeepSeek-Coder-V2-Instruct --gguf_path /mnt/models/deepseek-coder-v2:236b-instruct-q8_0

This is an 8-bit quantized gguf model that was pulled from the ollama library.

I will try to recreate the prompt again. It appears I accidentally deleted it when I periodically clean up my prompts.

@bitbottrap
Copy link
Author

I had been using the quantized model available from ollama. I have now downloaded the DeepSeek model and quantized it myself to q8_0 using llama.cpp code current as of today. I have not seen this error (yet) but have opened another issue with a different error. I think there is probably some low hanging fruit with debugging increased output token counts.

@qiyuxinlin
Copy link
Contributor

We have located the problem. We did not set the size of the kv cache to an interface. The default configuration is 4096. Conversations exceeding the length of 4096 will report an error. We are integrating the configuration files and will modify this problem in a later version. If you want to solve the problems you encounter in the current version, you can modify
cache_lens in /site-packages/ktransformers/ktransformers/server/backend/args.py

@arthurv
Copy link

arthurv commented Oct 28, 2024

We have located the problem. We did not set the size of the kv cache to an interface. The default configuration is 4096. Conversations exceeding the length of 4096 will report an error. We are integrating the configuration files and will modify this problem in a later version. If you want to solve the problems you encounter in the current version, you can modify cache_lens in /site-packages/ktransformers/ktransformers/server/backend/args.py

I've managed to change the kv cache to 8192, however changing it to 16384 results in CUDA out of memory errors - it tried to allocate on GPU 0 while there was free space on the other GPUs. Unifying the configuration would help as it's a bit confusing whether changing the GPU split option or editing the YAML files would solve the issue.

@bitbottrap
Copy link
Author

I've also managed to change the cache_lens, max_new_tokens, and max_response_tokens to 8192 and achieved long[er] output without error. (This might be a better discussion for the other bug though?) However, as arthurv reports above my attempt at 16384 yields a CUDA out of memory error. I had to resort to using the multiple-GPU yaml optimization file (DeepSeek-V2-Chat-multi-gpu.yaml) for the 8192 configuration to complete. I have two 24GB GPUs and the memory usage is interesting.

At one of the stages prior to text generation memory usage spikes:
| 0 N/A N/A 24567 C ...sv-ai/ktransformers.env/bin/python3 19140MiB |
| 1 N/A N/A 24567 C ...sv-ai/ktransformers.env/bin/python3 20530MiB |

And when text is being generated memory utilization drops:
| 0 N/A N/A 24567 C ...sv-ai/ktransformers.env/bin/python3 5232MiB |
| 1 N/A N/A 24567 C ...sv-ai/ktransformers.env/bin/python3 8518MiB |

When these settings are changed there's definitely a significant temporary increase in memory requirements. If that could be reduced or eliminated we'd probably be good to go. Though there does appear to be a significant performance impact for the 8192/multi-GPU configuration. (around 50% slower)

@qiyuxinlin
Copy link
Contributor

We used DeepSeek-V2-Chat-multi-gpu.yaml to test the 16k situation in local_chat and server, and the phenomenon you mentioned did not occur. Moreover, our program applies for VRAM when starting, and the GPU memory usage should not fluctuate significantly. Can you provide the yaml file and run command you used?

@arthurv
Copy link

arthurv commented Oct 30, 2024

Not OP but here's my experience:
ktransformers commit id a6a1cc0

Changed: ktransformers/server/backend/args.py
-> changed max_new_tokens to 16384
-> changed cache_lens to 16384

Starting command:
ktransformers --model_path deepseek-ai/DeepSeek-V2.5 --gguf_path DeepSeek-V2.5-Q6 --port 10002 --optimize_config_path ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V2-Chat-multi-gpu-4.yaml

I start chatting through the API - I enter a prompt with 39 tokens, it generates 804 tokens as a reply. I enter 804 more tokens as input, and it crashes with torch.outOfMemoryError: CUDA out of memory

Nvidia-smi gives this:

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     17382      C   ...nda3/envs/ktransformers2/bin/python    19120MiB |
|    1   N/A  N/A     17382      C   ...nda3/envs/ktransformers2/bin/python     3600MiB |
|    2   N/A  N/A     17382      C   ...nda3/envs/ktransformers2/bin/python     3600MiB |
|    3   N/A  N/A     17382      C   ...nda3/envs/ktransformers2/bin/python     4600MiB |
+---------------------------------------------------------------------------------------+

I have 4 GPUs, and you can see that it's mostly using only GPU 0. It crashed trying to allocate 6.23 GB on GPU 0 when there was only 5GB available. Is there a way to redistribute the memory use better?

@arthurv
Copy link

arthurv commented Oct 30, 2024

Adding to the info above - I reverted max_new_tokens and cache_lens to 8192, and launched ktransformers again with the same command.

Submitted an 804 token prompt and it was OK. Memory usage was:

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     19283      C   ...nda3/envs/ktransformers2/bin/python    17258MiB |
|    1   N/A  N/A     19283      C   ...nda3/envs/ktransformers2/bin/python    17724MiB |
|    2   N/A  N/A     19283      C   ...nda3/envs/ktransformers2/bin/python     2932MiB |
|    3   N/A  N/A     19283      C   ...nda3/envs/ktransformers2/bin/python     3932MiB |
+---------------------------------------------------------------------------------------+

@bitbottrap
Copy link
Author

bitbottrap commented Oct 30, 2024

The multi-GPU memory usage I displayed was from the following multi-GPU configuration command line:
ktransformers --port 8090 --cpu_infer 60 --model_path deepseek-ai/DeepSeek-Coder-V2-Instruct --gguf_path /mnt/models/DeepSeek-Coder-V2-Instruct --optimize_config_path /home/user/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V2-Chat-multi-gpu.yaml

Single GPU configured to generate 8192 tokens is not possible.

I went through my logs and believe this is from a single GPU configuration with DeepSeek v2 modified for increased output. What other details would you like me to provide?

INFO:     10.0.100.108:33434 - "POST /v1/chat/completions HTTP/1.1" 200 OK
2024-10-28 15:31:52,048 DEBUG /home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/ktransformers/server/backend/interfaces/transformers.py[185]: get input ids of shape torch.Size([1, 2903])
2024-10-28 15:31:52,049 DEBUG /home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/ktransformers/server/backend/interfaces/transformers.py[240]: input_ids: torch.Size([1, 2903])
2024-10-28 15:31:52,050 DEBUG /home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/ktransformers/server/backend/interfaces/transformers.py[262]: cache position: 0 to 2903
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/responses.py", line 259, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/responses.py", line 255, in wrap
    await func()
  File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/responses.py", line 232, in listen_for_disconnect
    message = await receive()
              ^^^^^^^^^^^^^^^
  File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/uvicorn/protocols/http/h11_impl.py", line 534, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.12/asyncio/locks.py", line 212, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7b18d3f63800
During handling of the above exception, another exception occurred:
  + Exception Group Traceback (most recent call last):
  |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/uvicorn/protocols/http/h11_impl.py", line 406, in run_asgi
  |     result = await app(  # type: ignore[func-returns-value]
  |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
  |     return await self.app(scope, receive, send)
  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/fastapi/applications.py", line 1054, in __call__
  |     await super().__call__(scope, receive, send)
  |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/applications.py", line 113, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/middleware/errors.py", line 187, in __call__
  |     raise exc
  |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/middleware/errors.py", line 165, in __call__
  |     await self.app(scope, receive, _send)
  |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/middleware/cors.py", line 85, in __call__
  |     await self.app(scope, receive, send)
  |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
  |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
  |     raise exc
  |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/routing.py", line 715, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/routing.py", line 735, in app
  |     await route.handle(scope, receive, send)
  |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/routing.py", line 288, in handle
  |     await self.app(scope, receive, send)
  |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/routing.py", line 76, in app
  |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
  |     raise exc
  |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/routing.py", line 74, in app
  |     await response(scope, receive, send)
  |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/responses.py", line 252, in __call__
  |     async with anyio.create_task_group() as task_group:
  |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 736, in __aexit__
  |     raise BaseExceptionGroup(
  | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/responses.py", line 255, in wrap
    |     await func()
    |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/starlette/responses.py", line 244, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 80, in check_client_link
    |     async for event in async_events:
    |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 93, in to_stream_reply
    |     async for event in async_events:
    |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 87, in add_done
    |     async for event in async_events:
    |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 107, in filter_chat_chunk
    |     async for event in async_events:
    |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/ktransformers/server/api/openai/endpoints/chat.py", line 26, in inner
    |     async for token in interface.inference(input_message,id):
    |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/ktransformers/server/backend/interfaces/transformers.py", line 323, in inference
    |     for t in self.prefill(input_ids,self.check_is_new(thread_id)):
    |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 36, in generator_context
    |     response = gen.send(None)
    |                ^^^^^^^^^^^^^^
    |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/ktransformers/server/backend/interfaces/transformers.py", line 272, in prefill
    |     logits = self.model(
    |              ^^^^^^^^^^^
    |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    |     return forward_call(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/ktransformers/models/modeling_deepseek.py", line 1731, in forward
    |     outputs = self.model(
    |               ^^^^^^^^^^^
    |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    |     return forward_call(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/ktransformers/operators/models.py", line 719, in forward
    |     layer_outputs = decoder_layer(
    |                     ^^^^^^^^^^^^^^
    |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    |     return forward_call(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/ktransformers/models/modeling_deepseek.py", line 1238, in forward
    |     hidden_states, self_attn_weights, present_key_value = self.self_attn(
    |                                                           ^^^^^^^^^^^^^^^
    |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    |     return forward_call(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/ktransformers/operators/attention.py", line 200, in forward
    |     cur_output, _, _ = self.forward_chunck(
    |                        ^^^^^^^^^^^^^^^^^^^^
    |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/ktransformers/operators/attention.py", line 128, in forward_chunck
    |     attn_weights = nn.functional.softmax(
    |                    ^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/user/.env/server/ktransformers.env/lib/python3.12/site-packages/torch/nn/functional.py", line 1890, in softmax
    |     ret = input.softmax(dim, dtype=dtype)
    |           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.81 GiB. GPU 0 has a total capacity of 23.58 GiB of which 1.27 GiB is free. Including non-PyTorch memory, this process has 22.29 GiB memory >
    +------------------------------------

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants