Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple workers do not share memory, which causes a full model reload for each message generation. #29

Open
ZachZimm opened this issue Aug 27, 2024 · 5 comments

Comments

@ZachZimm
Copy link
Contributor

In the current workers implementation, each worker creates its own ModelProvider, and so they do not share information about which models have been loaded into memory. This may be the cause of #26, but they may have simply been experiencing the fact that MLX models increase their memory usage with context usage (unlike GGUF).

To produce the issue:
note: due to the nature of FastAPI workers, the following is more likely to occur the higher the number of workers

  1. Start the server with the command fastmlx % uvicorn fastmlx:app --workers 2
  2. Send a response request (streaming or otherwise)
  3. See that the server output model load time / check memory usage
  4. Send another response request
  5. Notice another 'Model loaded in X seconds' message as well as well as doubled memory usage.

Suggested:
Change the default number of workers to 1 until a better approach to CPU parallelism is implemented.

@ZachZimm ZachZimm changed the title Multiple workers do not share memory, which causes a full model reload for each message. Multiple workers do not share memory, which causes a full model reload for each message generation. Aug 27, 2024
@Blaizzy
Copy link
Collaborator

Blaizzy commented Sep 4, 2024

Hey @ZachZimm

Thanks a lot!

This makes a lot of sense.

I thought having 2 workers will allow all users to have parallel calls enabled by default.

But it seems I didn't account for this issue.

I will make workers default to 1 and investigate how to have all workers talk or the master node handle routing.

@ZachZimm
Copy link
Contributor Author

ZachZimm commented Sep 5, 2024

I'd like to implement a memory sharing solution using multiprocessing.manager, essentially swapping out the ModelProvider.models dict with a shared memory multiprocessing.manager.dict. But, I would appreciate it if you would merge my previous PR first as I am not very familiar with git and would like to avoid having 2 working branches.

@Blaizzy
Copy link
Collaborator

Blaizzy commented Sep 10, 2024

hey @ZachZimm

Thanks for the patience, I was unavaible this last week.

That would be fantastic, no problem!

@Blaizzy
Copy link
Collaborator

Blaizzy commented Sep 10, 2024

I left some comments

@ZachZimm
Copy link
Contributor Author

So I tried implementing shared memory with multiprocessing.Manager but I found that mlx's lm_load would hang and eventually complain about something not being serializable. I really don't know what the proper way to go about sharing the model in memory is if this approach doesn't work, but maybe building ModelProvider up into a process separate from the FastAPI app (exclusively for model management, and providing a pointer to the model object to FastAPI workers) would be a workable approach?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@ZachZimm @Blaizzy and others