Metal Backend not properly loading large models at 16GB of RAM #1568

bayedieng · 2024-01-11T17:24:52Z

Ran Phi 2 model with the metal features enabled and seems to hang with about 7% of GPU usage from Activity monitor. This seems to be recent as it ran at an adequate speed wit
h some earlier commit, not sure which though. Also ran with stable diffusion turbo and i'm getting these results:

The accelerate feature seems to be twice as fast as the metal one when this was not the case before.

snehmehta · 2024-01-12T20:26:01Z

+1 For phi 2 on my macbook pro M1

cpu has 2.29 token/s
metal has 1.46 token/s

ivarflakstad · 2024-01-15T07:32:50Z

I have a M1 pro 32gb. Metal: 7.30 token/s vs accelerate: 2.28 token/s. Is it still slow for you?

okpatil4u · 2024-01-15T07:54:00Z

Thanks ! Which branch are you referring to? When would you be merging it with the main branch ?

…

On Mon, 15 Jan 2024 at 1:03 PM, ivarflakstad ***@***.***> wrote: I have a M1 pro 32gb. Metal: 7.30 token/s vs accelerate: 2.28 token/s. Is it still slow for you? — Reply to this email directly, view it on GitHub <#1568 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAXGU4CPJPZ2ZZ7KYC6B3SDYOTLSZAVCNFSM6AAAAABBW3QEXGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJRGQ3DKOBTGU> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

ivarflakstad · 2024-01-15T08:58:24Z

I'm on the main brain. That's why I'm asking if it is still slow for you :)

okpatil4u · 2024-01-15T09:01:46Z

I haven't tried for a really long time. I was waiting for this issue to be closed as an indication for MPS usability. Is it ready to be used now ?

ivarflakstad · 2024-01-15T09:15:45Z

There is experimental Metal support. Not using MPS right now - might add it in the future as a fallback for compatability reasons at some point.
So yes there is mac gpu support, and you can definitely play around with it, but don't expect insane speeds yet as we haven't started optimizing. It'll get there though.

bayedieng · 2024-01-15T10:05:47Z

I have a M1 pro 32gb. Metal: 7.30 token/s vs accelerate: 2.28 token/s. Is it still slow for you?

Just pulled the latest commit. I'm getting less then a token/s if the phi model manages to fully load at all (sometimes just hangs) on my base 16 inch M2 Macbook Pro. The accelerate feature still outperforms on it on my machine with 1.38 tokens/s vs 0.64 on the metal one when it it manged to run. Perhaps there is some large loading of memory going on which favors your 32GBs of RAM? Also tried Stable Diffusion Turbo and accelerate is still faster.

ivarflakstad · 2024-01-15T14:59:25Z

Memory could be the issue, but then I would expect your computer to be showing signs of that as you are running the model. Is it?

For comparison, could you run a phi-1 and see if the issue persists?

Narsil · 2024-01-16T09:43:23Z

@bayedieng could you try out : #1523 maybe ?

It's possible for metal to be slower if there's not enough memory available to run the model I think, otherwise it doesn't make a lot of sense.

Potential culprits:
1/ Fences
2/ simd sizes

The PR I linked removes the fences. They were necessary (still are technically) to avoid bugs where kernels would spawn in non expected order, leading to differences in logits. However, I tried to remove them, and the model overall still behaves correctly on all platforms I could test one with a ~2x speedup on M3, therefore I went for it

2/ https://github.com/huggingface/candle/blob/main/candle-metal-kernels/src/lib.rs#L1314 Here is the code that makes choices on the simd sizes of the actual matmul. Choice of those values have a super high impact on the overall speed of models. And we do need to tune them for each machine ideally (M1, M2, M3, and depends on the RAM size too). However, I'm not sure how to do that generally enough for now. You could still probably play around with numbers and see how well it performs maybe. Also we might need a specialized gemv implementation for those A.B.t matmul (the current matmul is highly optimized for the general case, A B.t, can be optimized further on its own because all those fetches are aligned)

bayedieng · 2024-01-16T22:12:51Z

I get a quantized not covered error when running quantized phi 2. However, it does seem to be an issue with memory as phi 1.5 expectantly outperforms accelerate. Perhaps some of the values @Narsil mentioned for SIMD changed at some point making it less optimal for lower memory, because both phi 2 and stable diffusion turbo ran faster on metal then accelerate in the past.

Narsil · 2024-01-17T09:40:59Z

because both phi 2 and stable diffusion turbo ran faster on metal then accelerate in the past.

The SIMD line is this single one: https://github.com/huggingface/candle/blob/main/candle-metal-kernels/src/lib.rs#L1277

Otherwise it's because of MPS usage (which we can't used because it' s bugged and doesn't support arbitrary striding which is necessary in candle to work properly for all models.

in the past.

Do you know when, or which commit/branch/version ? Might help narrow it down.

bayedieng · 2024-01-19T14:42:53Z

I tried the quantized phi 2 model and the Metal backend outperforms the accelerate framework as intended so it does indeed seem to be a memory issue. I might have been wrong about an older version being faster as I have just tried Stable Diffusion Turbo on an commit hash 85e5680 and accelerate was still faster than metal. Both inference and loading don't seem optimal for models that use larger memory, and Phi 2 only seems to fully perform inference on occasion when using Metal.

ivarflakstad · 2024-03-07T16:29:39Z

@bayedieng I recently refurbished the buffer allocator for metal, which is now merged in main - would you mind checking if it has improved the issue? :)

bayedieng · 2024-03-10T10:07:28Z

I've attempted to run it and I am still dealing with the same issue, entire system lags and model still runs inference slowly. the accelerate framework is still outperforming metal.

ivarflakstad · 2024-03-10T14:20:41Z

Ok thanks. Could you try using cargo-instruments -t Allocations and share what it looks like? :)

bayedieng changed the title ~~Metal Backend seems to lag when running on Phi 2~~ Metal Backend seems to be slower Jan 11, 2024

bayedieng changed the title ~~Metal Backend seems to be slower~~ Metal Backend not properly loading large models at 16GB of RAM Jan 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal Backend not properly loading large models at 16GB of RAM #1568

Metal Backend not properly loading large models at 16GB of RAM #1568

bayedieng commented Jan 11, 2024 •

edited

Loading

snehmehta commented Jan 12, 2024

ivarflakstad commented Jan 15, 2024

okpatil4u commented Jan 15, 2024 via email

ivarflakstad commented Jan 15, 2024

okpatil4u commented Jan 15, 2024

ivarflakstad commented Jan 15, 2024

bayedieng commented Jan 15, 2024

ivarflakstad commented Jan 15, 2024

Narsil commented Jan 16, 2024

bayedieng commented Jan 16, 2024

Narsil commented Jan 17, 2024 •

edited

Loading

bayedieng commented Jan 19, 2024

ivarflakstad commented Mar 7, 2024

bayedieng commented Mar 10, 2024

ivarflakstad commented Mar 10, 2024

Metal Backend not properly loading large models at 16GB of RAM #1568

Metal Backend not properly loading large models at 16GB of RAM #1568

Comments

bayedieng commented Jan 11, 2024 • edited Loading

snehmehta commented Jan 12, 2024

ivarflakstad commented Jan 15, 2024

okpatil4u commented Jan 15, 2024 via email

ivarflakstad commented Jan 15, 2024

okpatil4u commented Jan 15, 2024

ivarflakstad commented Jan 15, 2024

bayedieng commented Jan 15, 2024

ivarflakstad commented Jan 15, 2024

Narsil commented Jan 16, 2024

bayedieng commented Jan 16, 2024

Narsil commented Jan 17, 2024 • edited Loading

bayedieng commented Jan 19, 2024

ivarflakstad commented Mar 7, 2024

bayedieng commented Mar 10, 2024

ivarflakstad commented Mar 10, 2024

bayedieng commented Jan 11, 2024 •

edited

Loading

Narsil commented Jan 17, 2024 •

edited

Loading