-
-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] ExLlamaV2DynamicGenerator class is not multiple threads supported #690
Comments
I'm unsure what you're trying to achieve. The generator is extremely stateful and inherently single-threaded. It utilizes batching to run multiple requests concurrently, not threading. If you want threads to be able to start requests at any time, you'll need a single server thread calling the generator and a dispatch mechanism to distribute results back to clients. The specific error you're getting would occur if the job queue is manipulated in the middle of the |
I am attempting to generate three responses in parallel by calling the inner_model.generator.generate function. However, I observed that only one thread successfully returns a response, while the others result in errors. I have also created three instances of the ExLlamaV2DynamicGenerator class, as the generate function is being invoked three times.
access_with_lock function is working as intended.I am not using generate_response_open_usecase function anyways. |
@turboderp The reason I am initializing 3 models and there separate inference pipelines and not going for a batch inference approach is due to my particular need. During each inference, I have a multi stage structured output requirement, for which I set a large context during each such stage, given that the inference supports page caching, thus I assume, that in spite of setting the same context again and again, as a part of the prompt, the inherent mechanism would reuse the previous state and remove most of the redundant processing. Only after completion of the all the stages, I reset the cache explicitly. Now, the next issue is how can I cache the structured output filter (ExLlamaV2TokenEnforcerFilter) across these 3 instances. If I launch the 3 instances as separate python processes then there is no issue, only problem is that I need to have 3 instances of structured output filter as well, thus increasing my system RAM requirement. For this I am creating a pub sub mechanism, and in a single thread launching 3 such instances of inference and now this makes sharing the same filter memory space across all instances. The issue that I see now, somehow parallel inference corrupts output of each other. Now, I have printed cached tensor memory pointers of each model, and they do point to different location. I am still now sure, what is implicitly getting shared across, each model that is causing this issue. Can it be that sharing ExLlamaV2TokenEnforcerFilter itself be the culprit? |
You are correct that the generator will reuse keys/values from the previous context if the new context shares a prefix with it. However it remembers more than one context, so if I'm understanding you correctly you'd still be better served with a single instance with a larger cache. Pages are only evicted from the cache when it fills up, and there should be no need to manually clear it. Now, there still shouldn't be any global state causing the issues you're seeing, if I'm understanding correctly. I.e. two separate instances of model+cache+generator should be able to run across two separate threads (they may launch child threads, but still without any global state). If you had two generators referencing the same model instance from separate threads, that would cause issues, perhaps. The specific exception you're getting, though, that happens because one generator started a new job while it was in a generation loop and not expecting any new jobs to start. So I'm not sure how to diagnose, really. I'm a little curious how you're creating the model, cache and generator instances. For structured generation, you should be able to reuse parts of the filter across threads, but the filter itself is a stateful object that needs an instance per job. |
OS
Linux
GPU Library
CUDA 12.x
Python version
3.12
Pytorch version
2.4.1
Model
No response
Describe the bug
For generation of output , I have used generate function of ExLlamaV2DynamicGenerator class .Since I have implemented it in such a way that this method will be called in parallel threads , but If i am running it in 3 parallel threads ,response is generated only for one thread and getting error for other threads .
KeyError: 16
2024-11-26 05:57:16,799 Exception on /v1/via_prod_local_llm_in_request [POST]
Traceback (most recent call last):
File "/usr/local/lib/python3.11/dist-packages/flask/app.py", line 1473, in wsgi_app
response = self.full_dispatch_request()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/flask/app.py", line 882, in full_dispatch_request
rv = self.handle_user_exception(e)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/flask/app.py", line 880, in full_dispatch_request
rv = self.dispatch_request()
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/flask/app.py", line 865, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args) # type: ignore[no-any-return]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/server_exllama.py", line 106, in run_via_local_llm_in_request
output = mihup_llm_module.run_mihup_llm_inference(call_transcript=call_transcript_str,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/exllama_inference.py", line 143, in run_mihup_llm_inference
outputs = self.models[model_index].response_generator(prompts, filters, use_case_ids,self.universal_filter_map)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/exllama_inference.py", line 53, in response_generator
outputs = self.generator.generate(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/exllamav2/generator/dynamic.py", line 664, in generate
idx = order[r["serial"]]
~~~~~^^^^^^^^^^^^^
KeyError: 16
Reproduction steps
I am adding below the code snippet -
class LockedList:
def init(self, objects: List[Any]):
self.objects = objects
self.locks = [Lock() for _ in objects]
self.last_used_index = -1
self.access_lock = Lock() # Add global access lock
class ThreadSafeExLlamaV2:
def init(self, model_path: str, max_seq_len: int = 256 * 96):
self.config = ExLlamaV2Config(model_path)
self.model = ExLlamaV2(self.config)
self.cache = ExLlamaV2Cache_Q4(self.model, max_seq_len=max_seq_len, lazy=True)
self.model.load_autosplit(self.cache)
self.tokenizer = ExLlamaV2Tokenizer(self.config)
class MihupExLlamaLLM:
def init(self, thread_count: int):
thread_count = max(1, min(thread_count, 4))
print(f"Initializing {thread_count} model instances")
Expected behavior
Previously when I was running the server using gunicorn with varying workers count it was working very well but when i changed it to varying threads then it is not working
Logs
No response
Additional context
No response
Acknowledgements
The text was updated successfully, but these errors were encountered: