0.3.5b9 Memory leak with MLX models #63

coughmedicine2020 · 2024-12-23T20:56:57Z

Using an mlx conversion of a L3.3 b70 model in 8bit, each request seems to cause an huge memory leak. I've 33k context and each request uses around 10G of memory, which is roughly what the KVCache should be. A load and reload resets the memory usage to 79G which is what I would expect given the model size and quant level.

orcinus · 2024-12-23T21:38:59Z

Experiencing the same thing with 90B. Deleting history from context does not reduce memory footprint.
The moment first inference starts, memory usage starts to balloon and leak until it reaches 200+ GB (on a Mac Studio with 192GB) and engine eventually crashes.

yagil · 2024-12-24T00:20:15Z

Relevant comment here from the author of mlx-vlm:

Feature Request: Add support for Pixtral and other Vision models (llama 3.2 11b/90b etc) #5 (comment)

Will be fixed

coughmedicine2020 · 2024-12-30T01:38:57Z

@orcinus I got rid of all my mlx files except for a 12B and 22B model which seem to be not leaking with b10. You should give it a try.

orcinus · 2024-12-31T19:00:54Z

@coughmedicine2020 tried it, now it doesn't leak or OOM, but it has a new problem - inference keeps getting slower and slower, until it reaches multiple minutes per token, and just stops. Tested on llama 3.2 90B vision instruct. Tried with iogpu.disable_wired_collector enabled and iogpu.wired_limit_mb set to 180ish GB too (the model takes up about 160 with context and all).

Edit: GPU utilization is also only approx. 50%.

coughmedicine2020 · 2025-01-02T23:14:08Z

I tried with a 123B model I hand converted to 8bit for around 30 minutes and got a consistent ~5t/s when the context is under 16k and ~4t/s when the context is around 32k.

orcinus · 2025-01-03T01:33:35Z

I tried with a 123B model I hand converted to 8bit for around 30 minutes and got a consistent ~5t/s when the context is under 16k and ~4t/s when the context is around 32k.

Hmmm. Was it mllama too? For me it starts at 2-3 t/s, and then rapidly degrades until it's tens of seconds per token. And that's even with very small contexts (4k).

coughmedicine2020 · 2025-01-03T21:15:08Z

It is Mistral Large Instruct finetune.

orcinus mentioned this issue Dec 23, 2024

Feature Request: Add support for Pixtral and other Vision models (llama 3.2 11b/90b etc) #5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.3.5b9 Memory leak with MLX models #63

0.3.5b9 Memory leak with MLX models #63

coughmedicine2020 commented Dec 23, 2024

orcinus commented Dec 23, 2024

yagil commented Dec 24, 2024

coughmedicine2020 commented Dec 30, 2024

orcinus commented Dec 31, 2024 •

edited

Loading

coughmedicine2020 commented Jan 2, 2025

orcinus commented Jan 3, 2025 •

edited

Loading

coughmedicine2020 commented Jan 3, 2025

0.3.5b9 Memory leak with MLX models #63

0.3.5b9 Memory leak with MLX models #63

Comments

coughmedicine2020 commented Dec 23, 2024

orcinus commented Dec 23, 2024

yagil commented Dec 24, 2024

coughmedicine2020 commented Dec 30, 2024

orcinus commented Dec 31, 2024 • edited Loading

coughmedicine2020 commented Jan 2, 2025

orcinus commented Jan 3, 2025 • edited Loading

coughmedicine2020 commented Jan 3, 2025

orcinus commented Dec 31, 2024 •

edited

Loading

orcinus commented Jan 3, 2025 •

edited

Loading