-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
0.3.5b9 Memory leak with MLX models #63
Comments
Experiencing the same thing with 90B. Deleting history from context does not reduce memory footprint. |
Relevant comment here from the author of mlx-vlm: Will be fixed |
@orcinus I got rid of all my mlx files except for a 12B and 22B model which seem to be not leaking with b10. You should give it a try. |
@coughmedicine2020 tried it, now it doesn't leak or OOM, but it has a new problem - inference keeps getting slower and slower, until it reaches multiple minutes per token, and just stops. Tested on llama 3.2 90B vision instruct. Tried with iogpu.disable_wired_collector enabled and iogpu.wired_limit_mb set to 180ish GB too (the model takes up about 160 with context and all). Edit: GPU utilization is also only approx. 50%. |
I tried with a 123B model I hand converted to 8bit for around 30 minutes and got a consistent ~5t/s when the context is under 16k and ~4t/s when the context is around 32k. |
Hmmm. Was it mllama too? For me it starts at 2-3 t/s, and then rapidly degrades until it's tens of seconds per token. And that's even with very small contexts (4k). |
It is Mistral Large Instruct finetune. |
Using an mlx conversion of a L3.3 b70 model in 8bit, each request seems to cause an huge memory leak. I've 33k context and each request uses around 10G of memory, which is roughly what the KVCache should be. A load and reload resets the memory usage to 79G which is what I would expect given the model size and quant level.
The text was updated successfully, but these errors were encountered: