Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.3.5b9 Memory leak with MLX models #63

Open
coughmedicine2020 opened this issue Dec 23, 2024 · 7 comments
Open

0.3.5b9 Memory leak with MLX models #63

coughmedicine2020 opened this issue Dec 23, 2024 · 7 comments

Comments

@coughmedicine2020
Copy link

Using an mlx conversion of a L3.3 b70 model in 8bit, each request seems to cause an huge memory leak. I've 33k context and each request uses around 10G of memory, which is roughly what the KVCache should be. A load and reload resets the memory usage to 79G which is what I would expect given the model size and quant level.

@orcinus
Copy link

orcinus commented Dec 23, 2024

Experiencing the same thing with 90B. Deleting history from context does not reduce memory footprint.
The moment first inference starts, memory usage starts to balloon and leak until it reaches 200+ GB (on a Mac Studio with 192GB) and engine eventually crashes.

@yagil
Copy link
Member

yagil commented Dec 24, 2024

Relevant comment here from the author of mlx-vlm:

Will be fixed

@coughmedicine2020
Copy link
Author

@orcinus I got rid of all my mlx files except for a 12B and 22B model which seem to be not leaking with b10. You should give it a try.

@orcinus
Copy link

orcinus commented Dec 31, 2024

@coughmedicine2020 tried it, now it doesn't leak or OOM, but it has a new problem - inference keeps getting slower and slower, until it reaches multiple minutes per token, and just stops. Tested on llama 3.2 90B vision instruct. Tried with iogpu.disable_wired_collector enabled and iogpu.wired_limit_mb set to 180ish GB too (the model takes up about 160 with context and all).

Edit: GPU utilization is also only approx. 50%.

@coughmedicine2020
Copy link
Author

I tried with a 123B model I hand converted to 8bit for around 30 minutes and got a consistent ~5t/s when the context is under 16k and ~4t/s when the context is around 32k.

@orcinus
Copy link

orcinus commented Jan 3, 2025

I tried with a 123B model I hand converted to 8bit for around 30 minutes and got a consistent ~5t/s when the context is under 16k and ~4t/s when the context is around 32k.

Hmmm. Was it mllama too? For me it starts at 2-3 t/s, and then rapidly degrades until it's tens of seconds per token. And that's even with very small contexts (4k).

@coughmedicine2020
Copy link
Author

It is Mistral Large Instruct finetune.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants