-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple workers do not share memory, which causes a full model reload for each message generation. #29
Comments
Hey @ZachZimm Thanks a lot! This makes a lot of sense. I thought having 2 workers will allow all users to have parallel calls enabled by default. But it seems I didn't account for this issue. I will make workers default to 1 and investigate how to have all workers talk or the master node handle routing. |
I'd like to implement a memory sharing solution using |
hey @ZachZimm Thanks for the patience, I was unavaible this last week. That would be fantastic, no problem! |
I left some comments |
So I tried implementing shared memory with |
In the current workers implementation, each worker creates its own ModelProvider, and so they do not share information about which models have been loaded into memory. This may be the cause of #26, but they may have simply been experiencing the fact that MLX models increase their memory usage with context usage (unlike GGUF).
To produce the issue:
note: due to the nature of FastAPI workers, the following is more likely to occur the higher the number of workers
fastmlx % uvicorn fastmlx:app --workers 2
Suggested:
Change the default number of workers to 1 until a better approach to CPU parallelism is implemented.
The text was updated successfully, but these errors were encountered: