Multiple workers do not share memory, which causes a full model reload for each message generation. #29

ZachZimm · 2024-08-27T00:58:51Z

In the current workers implementation, each worker creates its own ModelProvider, and so they do not share information about which models have been loaded into memory. This may be the cause of #26, but they may have simply been experiencing the fact that MLX models increase their memory usage with context usage (unlike GGUF).

To produce the issue:
note: due to the nature of FastAPI workers, the following is more likely to occur the higher the number of workers

Start the server with the command fastmlx % uvicorn fastmlx:app --workers 2
Send a response request (streaming or otherwise)
See that the server output model load time / check memory usage
Send another response request
Notice another 'Model loaded in X seconds' message as well as well as doubled memory usage.

Suggested:
Change the default number of workers to 1 until a better approach to CPU parallelism is implemented.

The text was updated successfully, but these errors were encountered:

Blaizzy · 2024-09-04T09:17:24Z

Hey @ZachZimm

Thanks a lot!

This makes a lot of sense.

I thought having 2 workers will allow all users to have parallel calls enabled by default.

But it seems I didn't account for this issue.

I will make workers default to 1 and investigate how to have all workers talk or the master node handle routing.

ZachZimm · 2024-09-05T18:17:50Z

I'd like to implement a memory sharing solution using multiprocessing.manager, essentially swapping out the ModelProvider.models dict with a shared memory multiprocessing.manager.dict. But, I would appreciate it if you would merge my previous PR first as I am not very familiar with git and would like to avoid having 2 working branches.

Blaizzy · 2024-09-10T12:20:46Z

hey @ZachZimm

Thanks for the patience, I was unavaible this last week.

That would be fantastic, no problem!

Blaizzy · 2024-09-10T12:31:04Z

I left some comments

ZachZimm · 2024-09-12T21:39:24Z

So I tried implementing shared memory with multiprocessing.Manager but I found that mlx's lm_load would hang and eventually complain about something not being serializable. I really don't know what the proper way to go about sharing the model in memory is if this approach doesn't work, but maybe building ModelProvider up into a process separate from the FastAPI app (exclusively for model management, and providing a pointer to the model object to FastAPI workers) would be a workable approach?

ZachZimm changed the title ~~Multiple workers do not share memory, which causes a full model reload for each message.~~ Multiple workers do not share memory, which causes a full model reload for each message generation. Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple workers do not share memory, which causes a full model reload for each message generation. #29

Multiple workers do not share memory, which causes a full model reload for each message generation. #29

ZachZimm commented Aug 27, 2024

Blaizzy commented Sep 4, 2024

ZachZimm commented Sep 5, 2024 •

edited

Loading

Blaizzy commented Sep 10, 2024 •

edited

Loading

Blaizzy commented Sep 10, 2024

ZachZimm commented Sep 12, 2024

Multiple workers do not share memory, which causes a full model reload for each message generation. #29

Multiple workers do not share memory, which causes a full model reload for each message generation. #29

Comments

ZachZimm commented Aug 27, 2024

Blaizzy commented Sep 4, 2024

ZachZimm commented Sep 5, 2024 • edited Loading

Blaizzy commented Sep 10, 2024 • edited Loading

Blaizzy commented Sep 10, 2024

ZachZimm commented Sep 12, 2024

ZachZimm commented Sep 5, 2024 •

edited

Loading

Blaizzy commented Sep 10, 2024 •

edited

Loading