-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Status of MPS Support #181
Comments
I implemented the benchmarks in #182 now. They are implemented in https://github.com/computational-cell-analytics/micro-sam/blob/dev/development/benchmark.py. @GenevieveBuckley @Marei33 could you run these benchmarks on your laptops with the following four settings please: And then paste the results here. To run them you will need to check out the latest For reference, here are the results for
|
Link to gist with the results from yesterday (before we had the benchmark script) that caused us to start this investigation: https://gist.github.com/GenevieveBuckley/874b6b282b388524b6d62f25b3b9bb1c |
Note: to run the benchmark script, pandas requires the (usually optional) dependency |
|
Summary: something about AMG is really killing the performance for MPS backends. We'll need to do some line profiling on that part of the code to get more information. I am suspicious that the fallback to CPU might somehow involve a lot of transferring large tensors back and forth between the cpu and mps, and perhaps that is a large part of why it is so much slower than the cpu only computation.
I also see this conversion to int32 happening automatically (it's possible the int64 tensor is being created in segment-anything, not in the micro-sam code)
|
Line profiling with line_profiler The slowest part is in CPU
MPS
|
The speed is not better with pytorch-nightly, sadly. But the warning messages about CPU fallback are gone with pytorch-nightly, so it seems my suspicion about that being the reason things are slow is not correct. EDIT: it might still be possible that there is somehow a lot of transfer of information between cpu and mps, regardless. But it is very slow. It would be good to have some CUDA benchmarks to compare CPU performance to as well. Details``` (test-micro-sam-mps-nightly) genevieb@dyn-130-194-109-212 development % python benchmark.py -m vit_h -d cpu Running benchmarks for vit_h with device: cpu Running benchmark_embeddings ... Running benchmark_prompts ... Running benchmark_amg ... | model | device | benchmark | runtime | |:--------|:---------|:----------------------|-----------:| | vit_h | cpu | embeddings | 14.7882 | | vit_h | cpu | prompt-p1n0 | 0.0370929 | | vit_h | cpu | prompt-p2n4 | 0.0354571 | | vit_h | cpu | prompt-box | 0.0344546 | | vit_h | cpu | prompt-box-and-points | 0.0367029 | | vit_h | cpu | amg | 41.8242 | (test-micro-sam-mps-nightly) genevieb@dyn-130-194-109-212 development % python benchmark.py -m vit_h -d mps Running benchmarks for vit_h with device: mps Running benchmark_embeddings ... Running benchmark_prompts ... Running benchmark_amg ... | model | device | benchmark | runtime | |:--------|:---------|:----------------------|------------:| | vit_h | mps | embeddings | 9.93057 | | vit_h | mps | prompt-p1n0 | 0.032023 | | vit_h | mps | prompt-p2n4 | 0.030647 | | vit_h | mps | prompt-box | 0.0267961 | | vit_h | mps | prompt-box-and-points | 0.0303428 | | vit_h | mps | amg | 282.099 | (test-micro-sam-mps-nightly) genevieb@dyn-130-194-109-212 development % conda list | grep torch pytorch 2.2.0.dev20230907 py3.10_0 pytorch-nightly torchvision 0.17.0.dev20230907 py310_cpu pytorch-nightly ``` |
Hi @GenevieveBuckley, |
No, not really 😢 n=1 iterations, because this benchmark takes a long time to run
|
However your theory about memory management problems with MPS may still be correct. It might just happening in a different place. The run length encoding function I found this kaggle notebook on faster RLE implementations. Next steps:
|
I would say "partially" ;). We see the runtime decrease from 465.819 to 158.788 for batch_size 64 -> 16. That's a huge improvement already! But you're of course right that we are not close to the performance on CPU yet, and indeed the reason seems to be that the RLE encoding is also significantly slower in MPS. I am not sure if this is due to memory or just due to some operation used there not being as efficient in MPS yet. (And in general thanks for running the line profiler, I only skimmed it, but it's quite interesting to see where the runtime is spent!) My suggestion would be:
And re contributions to Segment Anything: I agree that it would be nice to contribute this upstream, but given that they are currently not really responding in the repo opening a PR there directly is probably not worth it rn. I will try to get in contact with them at some point in the next weeks and see if there is some way to get them to review "sensible" community contributions. (I know the SAM first author a tiny bit, and there's always the alternative of twitter shaming Yann LeCun ;) ). |
@GenevieveBuckley: I think we have a consistent speed up with MPS now, after merging #190. Can you confirm? @Marei33 could you also benchmark this on your MAC?
(Let me know if anything still unclear.) |
Here are the benchmarks from the updated dev branch
|
It's really striking how much faster the embeddings are with a smaller model. I should really consider using a smaller model wherever possible. The advantage is not just the embedding time, but also the time taken to load the model (kinda obvious, but it's a huge chunk of the time taken in the example scripts). The micro-sam project should probably also consider if the default model can be changed to a smaller model without a substantial decrease in segmentation quality. That's not entirely straightforward (a) someone would need to assess the difference in segmentation quality, (b) there are no Should we make an issue for this discussion? It seems like a good long term goal for the project, and aligned with some of the other current work around adding mobileSAM / fastSAM models too. |
Yes on both fronts:
And good to see that we now have a consistent advantage for MPS here. I will wait for @Marei33 to run the benchmarks as well and then close the issue (unless we see something unexpected). |
Hi @Marei33, P.s.: you can copy paste the text of the table here and it gets displayed as a nice table. (But you can also keep posting screenshots if you prefer ;)). |
Yes, sure. Here are the results, when nothing else is running. But they are looking quite similar.
For the vit_h I had to run it three times, because there was still the memory error. |
Ok, thanks for checking again! @GenevieveBuckley @Marei33 how much RAM do your MACs have? |
I will go ahead and close this for now, I think we have done everything that is currently possible for MPS support. We will expose more control about the devices to the users in any case (I will create a separate issue on this). |
I have a 16GB Macbook Pro (from the first M1 generation) |
Keeping track of the MPS related things @GenevieveBuckley and me discovered while working on MPS support:
AMG
, and in code that we duplicate in order to make using the mask generation interactively more convenient. So whether this PR gets merged or not does not make a real difference for us, but it's still good to keep track of it.The text was updated successfully, but these errors were encountered: