You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I found an example regarding using Flask for API requests. I gave it a try, but when making concurrent requests, the generated responses from the inference appear as garbled text. I suspect this might be due to concurrent inference for two questions. Is it possible to perform answer generation concurrently?
The text was updated successfully, but these errors were encountered:
There's no support for concurrency, no. You'd need a separate instance for each thread, with its own generator and cache, and some mechanism for sensibly splitting the work between threads, given that the implementation completely occupies the GPU.
You could possibly have a streaming API that dispatches to multiple generators when there are concurrent requests, but you'd need a lot of VRAM to accommodate that.
I found an example regarding using Flask for API requests. I gave it a try, but when making concurrent requests, the generated responses from the inference appear as garbled text. I suspect this might be due to concurrent inference for two questions. Is it possible to perform answer generation concurrently?
The text was updated successfully, but these errors were encountered: