-
Notifications
You must be signed in to change notification settings - Fork 973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metal Backend not properly loading large models at 16GB of RAM #1568
Comments
+1 For phi 2 on my macbook pro M1
|
I have a M1 pro 32gb. Metal: 7.30 token/s vs accelerate: 2.28 token/s. Is it still slow for you? |
Thanks ! Which branch are you referring to? When would you be merging it
with the main branch ?
…On Mon, 15 Jan 2024 at 1:03 PM, ivarflakstad ***@***.***> wrote:
I have a M1 pro 32gb. Metal: 7.30 token/s vs accelerate: 2.28 token/s. Is
it still slow for you?
—
Reply to this email directly, view it on GitHub
<#1568 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAXGU4CPJPZ2ZZ7KYC6B3SDYOTLSZAVCNFSM6AAAAABBW3QEXGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJRGQ3DKOBTGU>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
I'm on the main brain. That's why I'm asking if it is still slow for you :) |
I haven't tried for a really long time. I was waiting for this issue to be closed as an indication for MPS usability. Is it ready to be used now ? |
There is experimental Metal support. Not using MPS right now - might add it in the future as a fallback for compatability reasons at some point. |
Just pulled the latest commit. I'm getting less then a token/s if the phi model manages to fully load at all (sometimes just hangs) on my base 16 inch M2 Macbook Pro. The accelerate feature still outperforms on it on my machine with 1.38 tokens/s vs 0.64 on the metal one when it it manged to run. Perhaps there is some large loading of memory going on which favors your 32GBs of RAM? Also tried Stable Diffusion Turbo and accelerate is still faster. |
Memory could be the issue, but then I would expect your computer to be showing signs of that as you are running the model. Is it? For comparison, could you run a phi-1 and see if the issue persists? |
@bayedieng could you try out : #1523 maybe ? It's possible for metal to be slower if there's not enough memory available to run the model I think, otherwise it doesn't make a lot of sense. Potential culprits: The PR I linked removes the fences. They were necessary (still are technically) to avoid bugs where kernels would spawn in non expected order, leading to differences in logits. However, I tried to remove them, and the model overall still behaves correctly on all platforms I could test one with a ~2x speedup on M3, therefore I went for it 2/ https://github.com/huggingface/candle/blob/main/candle-metal-kernels/src/lib.rs#L1314 Here is the code that makes choices on the simd sizes of the actual matmul. Choice of those values have a super high impact on the overall speed of models. And we do need to tune them for each machine ideally (M1, M2, M3, and depends on the RAM size too). However, I'm not sure how to do that generally enough for now. You could still probably play around with numbers and see how well it performs maybe. Also we might need a specialized gemv implementation for those A.B.t matmul (the current matmul is highly optimized for the general case, A B.t, can be optimized further on its own because all those fetches are aligned) |
I get a quantized not covered error when running quantized phi 2. However, it does seem to be an issue with memory as phi 1.5 expectantly outperforms accelerate. Perhaps some of the values @Narsil mentioned for SIMD changed at some point making it less optimal for lower memory, because both phi 2 and stable diffusion turbo ran faster on metal then accelerate in the past. |
The SIMD line is this single one: https://github.com/huggingface/candle/blob/main/candle-metal-kernels/src/lib.rs#L1277 Otherwise it's because of MPS usage (which we can't used because it' s bugged and doesn't support arbitrary striding which is necessary in candle to work properly for all models.
Do you know when, or which commit/branch/version ? Might help narrow it down. |
I tried the quantized phi 2 model and the Metal backend outperforms the accelerate framework as intended so it does indeed seem to be a memory issue. I might have been wrong about an older version being faster as I have just tried Stable Diffusion Turbo on an commit hash 85e5680 and accelerate was still faster than metal. Both inference and loading don't seem optimal for models that use larger memory, and Phi 2 only seems to fully perform inference on occasion when using Metal. |
@bayedieng I recently refurbished the buffer allocator for metal, which is now merged in main - would you mind checking if it has improved the issue? :) |
I've attempted to run it and I am still dealing with the same issue, entire system lags and model still runs inference slowly. the accelerate framework is still outperforming metal. |
Ok thanks. Could you try using |
Ran Phi 2 model with the metal features enabled and seems to hang with about 7% of GPU usage from Activity monitor. This seems to be recent as it ran at an adequate speed wit
h some earlier commit, not sure which though. Also ran with stable diffusion turbo and i'm getting these results:
The accelerate feature seems to be twice as fast as the metal one when this was not the case before.
The text was updated successfully, but these errors were encountered: