Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metal Backend not properly loading large models at 16GB of RAM #1568

Open
bayedieng opened this issue Jan 11, 2024 · 15 comments
Open

Metal Backend not properly loading large models at 16GB of RAM #1568

bayedieng opened this issue Jan 11, 2024 · 15 comments

Comments

@bayedieng
Copy link
Contributor

bayedieng commented Jan 11, 2024

Ran Phi 2 model with the metal features enabled and seems to hang with about 7% of GPU usage from Activity monitor. This seems to be recent as it ran at an adequate speed wit
h some earlier commit, not sure which though. Also ran with stable diffusion turbo and i'm getting these results:

Screenshot 2024-01-11 at 5 55 56 PM

The accelerate feature seems to be twice as fast as the metal one when this was not the case before.

@bayedieng bayedieng changed the title Metal Backend seems to lag when running on Phi 2 Metal Backend seems to be slower Jan 11, 2024
@snehmehta
Copy link

+1 For phi 2 on my macbook pro M1

  • cpu has 2.29 token/s
  • metal has 1.46 token/s

@ivarflakstad
Copy link
Member

I have a M1 pro 32gb. Metal: 7.30 token/s vs accelerate: 2.28 token/s. Is it still slow for you?

@okpatil4u
Copy link

okpatil4u commented Jan 15, 2024 via email

@ivarflakstad
Copy link
Member

I'm on the main brain. That's why I'm asking if it is still slow for you :)

@okpatil4u
Copy link

I haven't tried for a really long time. I was waiting for this issue to be closed as an indication for MPS usability. Is it ready to be used now ?

@ivarflakstad
Copy link
Member

There is experimental Metal support. Not using MPS right now - might add it in the future as a fallback for compatability reasons at some point.
So yes there is mac gpu support, and you can definitely play around with it, but don't expect insane speeds yet as we haven't started optimizing. It'll get there though.

@bayedieng
Copy link
Contributor Author

I have a M1 pro 32gb. Metal: 7.30 token/s vs accelerate: 2.28 token/s. Is it still slow for you?

Just pulled the latest commit. I'm getting less then a token/s if the phi model manages to fully load at all (sometimes just hangs) on my base 16 inch M2 Macbook Pro. The accelerate feature still outperforms on it on my machine with 1.38 tokens/s vs 0.64 on the metal one when it it manged to run. Perhaps there is some large loading of memory going on which favors your 32GBs of RAM? Also tried Stable Diffusion Turbo and accelerate is still faster.

@ivarflakstad
Copy link
Member

Memory could be the issue, but then I would expect your computer to be showing signs of that as you are running the model. Is it?

For comparison, could you run a phi-1 and see if the issue persists?

@Narsil
Copy link
Collaborator

Narsil commented Jan 16, 2024

@bayedieng could you try out : #1523 maybe ?

It's possible for metal to be slower if there's not enough memory available to run the model I think, otherwise it doesn't make a lot of sense.

Potential culprits:
1/ Fences
2/ simd sizes

The PR I linked removes the fences. They were necessary (still are technically) to avoid bugs where kernels would spawn in non expected order, leading to differences in logits. However, I tried to remove them, and the model overall still behaves correctly on all platforms I could test one with a ~2x speedup on M3, therefore I went for it

2/ https://github.com/huggingface/candle/blob/main/candle-metal-kernels/src/lib.rs#L1314 Here is the code that makes choices on the simd sizes of the actual matmul. Choice of those values have a super high impact on the overall speed of models. And we do need to tune them for each machine ideally (M1, M2, M3, and depends on the RAM size too). However, I'm not sure how to do that generally enough for now. You could still probably play around with numbers and see how well it performs maybe. Also we might need a specialized gemv implementation for those A.B.t matmul (the current matmul is highly optimized for the general case, A B.t, can be optimized further on its own because all those fetches are aligned)

@bayedieng
Copy link
Contributor Author

I get a quantized not covered error when running quantized phi 2. However, it does seem to be an issue with memory as phi 1.5 expectantly outperforms accelerate. Perhaps some of the values @Narsil mentioned for SIMD changed at some point making it less optimal for lower memory, because both phi 2 and stable diffusion turbo ran faster on metal then accelerate in the past.

@Narsil
Copy link
Collaborator

Narsil commented Jan 17, 2024

because both phi 2 and stable diffusion turbo ran faster on metal then accelerate in the past.

The SIMD line is this single one: https://github.com/huggingface/candle/blob/main/candle-metal-kernels/src/lib.rs#L1277

Otherwise it's because of MPS usage (which we can't used because it' s bugged and doesn't support arbitrary striding which is necessary in candle to work properly for all models.

in the past.

Do you know when, or which commit/branch/version ? Might help narrow it down.

@bayedieng bayedieng changed the title Metal Backend seems to be slower Metal Backend not properly loading large models at 16GB of RAM Jan 19, 2024
@bayedieng
Copy link
Contributor Author

I tried the quantized phi 2 model and the Metal backend outperforms the accelerate framework as intended so it does indeed seem to be a memory issue. I might have been wrong about an older version being faster as I have just tried Stable Diffusion Turbo on an commit hash 85e5680 and accelerate was still faster than metal. Both inference and loading don't seem optimal for models that use larger memory, and Phi 2 only seems to fully perform inference on occasion when using Metal.

@ivarflakstad
Copy link
Member

@bayedieng I recently refurbished the buffer allocator for metal, which is now merged in main - would you mind checking if it has improved the issue? :)

@bayedieng
Copy link
Contributor Author

I've attempted to run it and I am still dealing with the same issue, entire system lags and model still runs inference slowly. the accelerate framework is still outperforming metal.

@ivarflakstad
Copy link
Member

Ok thanks. Could you try using cargo-instruments -t Allocations and share what it looks like? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants