-
Notifications
You must be signed in to change notification settings - Fork 973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apple silicon (MPS backends) support? #313
Comments
It should work out of the box ! Performance is not optimized, we haven't even started metal backend either but it's on the roadmap ! LaurentMazare/gemm#5 Original repo and credits for the actual performance tricks: https://github.com/sarah-ek/gemm (We forked with sarah-ek 's blessing, we'll hopefully merge upstream no too far in the future). |
This would be a huge addition. Have you thought about https://github.com/philipturner/metal-flash-attention for the flash attention alternative ? |
I wasn't aware of that project. In general for adding backends the biggest thing we can do is adding support for custom ops from day1, wether it's metal webgpu or rocm. That's because we cannot build ALL the ops for all the backends in a timely fashion. However having used quite a bit of frameworks, usually you need only 1 op for your particular use case (like f16 conv2d, or flash, or GPTQ), and you don't care about making sure it works with backprop and 200 other ops and op order. |
Thanks, that makes sense. Are you planning to add apple MPS support in version 2.0 ? |
Here is the current ideas floating around so not really. |
I am also for MPS support. |
Any idea when MPS support will be launched ? |
Would a Vulkan compute shader backend get a close performance to MPS? |
Someone like @philipturner could comment on this. |
Both Vulkan and MPS would have terrible compute performance. Both less than 50% of the max utilization for some important use cases. Vulkan like 20%, as slow as just running on CPU/Accelerate. GEMM and related ops are very difficult to get right with custom shaders. LLaMA.cpp is an exception, they can make custom Metal shaders because the compute bottleneck isn't the type of operations that require a lot of expertise. To use Vulkan shaders, you either need GLSL (archaic) or WGSL. The latter is more advanced, but still lacks some very important features to reach maximum GPU performance. We're talking a factor of 3 times slower in the parts that matter in some instances. You need SIMD reductions, SIMD matrix multiplications, etc. None of those are available in any API except Metal (though I did backdoor the macOS-only OpenCL framework for a portion of those, but OpenCL doesn't work on iOS). For MPS, the lack of control over CPU-side overhead can cause major issues for small data sets. Also, people generally don't consider CPU-GPU communication latency or GPU command encoding overhead as something to optimize for. They write a CPU-side wrapper in a host language that eagerly dispatches the GPU commands one-at-a-time. CUDA was optimized for this, but Metal was optimized for a completely different usage pattern. However, Apple's OpenCL driver interestingly does have the automatic batching present in CUDA and ROCm. |
Does accelerate run on ANE ? I have observed that the performance does not
scale for more than a single rayon thread when using accelerate. Wouldn’t
MPS be better in this case as ANE is smaller compared to the apple GPU ?
…On Fri, 17 Nov 2023 at 1:29 AM, Philip Turner ***@***.***> wrote:
Would a Vulkan compute shader backend get a close performance to MPS?
Both Vulkan and MPS would have terrible compute performance. Both less
than 50% of the max utilization for some important use cases. Vulkan like
20%, as slow as just running on CPU/Accelerate. GEMM and related ops are
very difficult to get right with custom shaders. LLaMA.cpp is an exception,
they can make custom Metal shaders because the compute bottleneck isn't the
type of operations that require a lot of expertise.
To use Vulkan shaders, you either need GLSL (archaic) or WGSL. The latter
is more advanced, but still lacks some very important features to reach
maximum GPU performance. We're talking a factor of 3 times slower in the
parts that matter in some instances. You need SIMD reductions, SIMD matrix
multiplications, etc. None of those are available in any API except Metal
(though I did backdoor the macOS-only OpenCL framework for a portion of
those, but OpenCL doesn't work on iOS). For MPS, the lack of control over
CPU-side overhead can cause major issues for small data sets.
Also, people generally don't consider CPU-GPU communication latency or GPU
command encoding overhead as something to optimize for. They write a
CPU-side wrapper in a host language that eagerly dispatches the GPU
commands one-at-a-time. CUDA was optimized for this, but Metal was
optimized for a completely different usage pattern. However, Apple's OpenCL
driver interestingly does have the automatic batching present in CUDA and
ROCm.
—
Reply to this email directly, view it on GitHub
<#313 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAXGU4FCVKJ6C2FQ67ZOL2DYEZWARAVCNFSM6AAAAAA3CAKKACVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJVGIZDIMRWHE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I put in some performance metrics for the M1 Max chip, to give some context.
Accelerate runs on the CPU. It uses the AMX coprocessor, a hardware-accelerated matrix multiplier for the CPU. Its primary benefit is low latency and usage in very difficult-to-parallelize tasks, like matrix factorization. The way this hardware works, every group of 4-6 CPU cores comes with a single accelerator. In many cases, multithreading or using multiple accelerators (e.g. M3 Max, with two P-blocks) actually harms performance. Accelerate automatically promotes supermassive tasks to 2 threads when appropriate.
Metal runs on the GPU. MPS is a proprietary closed-source library with some GPU kernels. MPSGraph is a domain-specific language for Swift, which automatically uses the ANE on certain devices. On iPhone and M1 (non-Pro), you can sometimes activate ANE by using FP16 and a supermassive matrix (4000 x 4000 x 4000). Generally, it's only 2x as fast as the GPU for GEMM. I wrote a library that's an alternative to MPS, and only uses the GPU. But it gives you more programmatic access to the GPU and implements GEMM algorithms better than Apple's MPS team.
ANE is very hard to get full programmatic access to. Behind the scenes, it's a programmable fused multiply-add engine. You have to hard-code the weights into the assembly language, and it only supports FP16, so it can't be used for AI training. Not very useful for general-purpose compute, just for AI inferencing through CoreML.
|
Thanks. This is super useful. Can MFA support quantized operations as well (4 bit, 5 bit) ? If yes, then what kind of benchmarks should one expect ? What would be a good starting point for a rust developer ? FYI, significant metal development is already in progress for candle framework. |
I just realized that pytorch has got some Metal shaders (https://github.com/pytorch/pytorch/tree/c233cef8fd648157fbaf85e1f2493562316a1ec4/aten/src/ATen/native/metal/ops). Might be useful for the candle Metal backend |
This is not really Metal, just some CPU code that calls into MPS.
PyTorch's shaders are very basic elementwise ones. Similar to LLaMA.cpp in scope. The difficult/important one is GEMM-like operations, which are non-trivial to optimize.
I made MFA so that you can modify it yourself. You can make a fork, tweak the shaders, support different quantizations if you want. Although at least for SDXL, it's advantageous to dequantize into a separate buffer before running GEMM. Less effort to write a new shader, potentially faster (due to less redundant compute). |
I have had a chance to look into MFA earlier. To be honest the project is intimidating to get started with. The documentation is scarce. A quick comparison would be https://github.com/Dao-AILab/flash-attention. If this entry barrier is removed, I am sure I could introduce MFA to my group of developers. |
It's ~500-1000 lines of shader code for GEMM and another ~500-1000 for Attention. The source files are mostly self-documenting, with a lot of Metal function constants. You can enable or disable each function constant while creating the The matrix dimensions need to be known at pipeline creation time. That way, some clever constant folding magic in the backend compiler will happen. It's extremely simple, but extremely powerful code generation. Executed much more effectively and robustly than MPS. The function constants describing matrix dimensions are usually capital letters. Examples: I pre-compiled some Metal binaries and hosted them on the GitHub releases page. That removes the need to download esoteric Xcode versions and go through the complex build process. Just copy the binary file and load the |
@ivarflakstad any interest in commenting? This is a Rust framework trying to use Metal for matrix multiplication. |
Yes! I'm actually involved in this work already working with @Narsil. We've also been working on adding mps support to I began working on a wrapper around metal flash attention here, but haven't had time to complete it. GEMM works though😊 |
As a comment on documentation, dauntingness, etc I found the metal code itself fairly straightforward (if we take into consideration that we're talking about GPGPU GEMM, flash attention, and pushing the limits of what is possible with Metal). It was the actually the swift tests orchestrating the metal execution that took me a while to understand. Calculating the correct values for func constants, async pipeline, cached operations etc. (Perhaps most importantly @philipturner kindly answered all my dumb questions😉) |
Also, MFA is currently not running optimal on the new A17/M3 architecture. Requires some major rewriting, which should get funded next summer. In the meantime, use MPS for FP32 GEMM and MFA for Attention (on
My primary interest is in a different field. I do document stuff very well when I want to (example). Just doing MFA as work during the summers.
When you write your own codebase, and write both the Swift and C++ impl, this is the result. I do try to make the code as legible as possible, using modern coding practices. The unit tests got really unwieldy, just enough to "do the job" given time constraints. The tensor library, I wrote from scratch. I needed to bypass issues with CPU overhead to properly benchmark performance. Batching almost 100 GPU commands into a single command buffer, yet using an API that's semantically eager. A bit of compiler/DSL engineering.
I remember boasting to Tristan Dao about having the shortest, most elegant implementation on his list (inspired by Tinygrad). Other codebases are at least tens of thousands of lines, took >1 month to fully comprehend them. Also I came up with FlashAttention-2 independently before Tristan released the algorithm. |
Thanks for clearing it out. This looks very impressive from where I stand. I come from Molecular Simulations background as well, I worked on OpenMM in 2011-15, it's early days. Finding a minimal working example was the issue for me when I briefly went through MFA. @philipturner a working 100 lines swift example would be great, if you could spare some time. Thank you. Edit: Never mind. Just saw this https://github.com/ivarflakstad/metal-flash-attention-rs. This is a good starting point. |
If you're interested contributions are very welcome, @okpatil4u 😊 |
It may be an older iteration of the Metal function constants API, but here (200 lines): https://gist.github.com/philipturner/60c9b196a2e9361f1ec15a99a9267268 Edit: Yeah, this seems old because there’s no function constants for explicitly setting the block sizes. |
Thanks @philipturner @ivarflakstad, I will look into it. |
Is there anything I can do ? I want to work with you |
This PR implements some basic metal. It seems to work ok (tried M1 + M3). Although speedups are only available on M3 for Phi and larger models. |
I wonder if Apple fixed the sequential throughput bottleneck in their drivers with the M3 generation. I'll have to benchmark my A17 Pro when I have the time. |
Quantized is still using CPU as CPU is hardcoded for now. |
Hey @Narsil, just checking if there has been an update on this. |
On quantized or metal support? Can't comment on quantized. |
I was checking for metal support. Apple recently released MLX. Not sure if you had any chance to look at it. Mistral 16bit example is pretty fast and prompt evaluation time is almost non existent. Which removes the need of flash attention. Something that even llama.cpp lacked for apple silicon architecture. It even has c++ api that could be readily used. Would this be useful for what you are working on ? |
Yes, I saw it :) Very cool stuff. |
Their GEMM kernels are slower than MPS because they don't use SIMD-group async copy instructions. This is the reason MFA was so finicky. You had to use an older Xcode toolchain to access the hardware instructions that Apple hid from the official shading language. I doubt MLX did rigorous benchmarks of how well their code performs across all possible matrix sizes. |
Metal/MPS support has been added for a couple months now, let's close this issue and open new ones if new problems arise. |
Point me out please what should I do to use Candle with Mac M1 GPU support. I was unable to find it reading the README file. In my case, it utilize only CPU and maybe 10% GPU. |
It already utilizes the GPU. GPU utilization depends on several factors. Note that not all operations are able to make use of GPU. This goes for all computations in general, not just AI. To utilize the GPU, the number of computations to memory transfers must be extremely high. For example, LLMs make poor use of GPUs and excellent use of CPUs because they are memory bandwidth bound. |
My goal is to load MADLAD-400 on my Macbook M1 8GB (the only hardware I have). I'm forced to use a quantized .gguf version of this model. The only way to achieve this goal is via Candle (as described in the official Google MADLAD-400's HF repo). Regularly, I'm using LLAMA CPP to run the quantized models. And, in case of 4bit quantization of 7B and even 8B models (latest LLAMA 3), it performs quite fast on GPU. Meanwhile, when I'm trying to run MADLAD-400 7B or even 3B with candle, I see that the GPU resources are not used at all. Looks like even CPU resources are not used in a full power. And I see only 1 token/sec maximum, which make it useless for me. I suppose that I need to use some additional parameter to run it or compile it another way. I'm using this approach from the documentation:
But I was unable to find any other extra parameters or whatever to run this package with GPU support. |
You will want to use |
Can you please suggest me the exact command I need to execute? As follows from one video I've seen on Youtube, this parameter should be used during the cargo build. I'm not experienced in Rust at all. And I was unable to find any text instruction where there is at least something about this parameter (--features metal). And the only result I see in Google when I'm trying to search for this issue is this discussion. |
I reinstalled cargo, candle, and everything. I executed cargo build --features metal in my Candle directory. No, Candle uses CPU instead of GPU for sure. It doesn't utilize GPU at all. Point me out what am I doing wrong. |
In other words when you call In summary you want something like
|
Looks like CPU is hard-coded?
|
Hah. Nice catch. |
Yes |
I take it Candle is using MFA for the GEMM kernel (not the attention kernel)? If so, you can obsolete the current https://gist.github.com/philipturner/84f613a5cc745460a914d2c6ad226131 Optimized for M3 and BF16 |
Hey folks, I made 2 small changes as suggested above to the code and used:
for both load and build methods and when I run
I still receive a weird error on M1 Utltra: |
let device = Device::new_metal(0)?; use this to replace let device = Device::Cpu; But even with metal, the rate in M2, 2-3 tokens/sec, it's still a little slow
|
“a little slow” is a bit subjective. Not a lot of context to evaluate how to improve on the speed. Have you made a roofline model of the minimum latency per token, by dividing the model size (in GB) by the hardware bandwidth (in GB/s)? Does the model use speculative execution to amplify sequential throughput? Are there any major compute-bound components? |
Support running on Macbook?
The text was updated successfully, but these errors were encountered: