-
Notifications
You must be signed in to change notification settings - Fork 13.2k
ci: Properly install rocwmma for hip builds #16305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
44c740e
to
e274516
Compare
We require rocwmma to be properly installed for #16221 see #16221 (comment) Unfortunately i dont have a windows environment to work in, and have been having a hard time getting the rocwmma buildsystem to work on the windows ci, see cb59db2 for the attempt. |
rocwmma is also used in the binary releases in |
rocm 7.0 has been released. disabeling rocwmma on windows will reduce performance yes, but the way rocwmma was used on windows previously was just plain wrong, you cant use the library from its source directory without installing it, even though it is header implemented. |
Ok, so I guess the question for the binary releases is whether using ROCM 7 or rocwmma is more important. |
ideally someone with a windows machine would come and fix the install - but as i say i dont have such a machine. |
the changes in the pr #16221 make llamacpp build against rocm 7 at all, and break the windows ci (beacuse the rocwmma installation there is incomplete). So this is not just about the binary releases. |
Ok, but we still cannot break or degrade the performance of the windows binaries just for the sake of supporting ROCM 7. This needs to be fixed. |
I find this a bit difficult i explicitly merged rocwmma support with it disabled by default, which it still is, as i knew that supporting it across architectures and os's is quite difficult as its very unstable - which has proven true often, with rdna4 being broken on rocwmma < 2 and > 1.4 and CDNA being broken on rocwmma 2.0 and rocwmma having no official windows support at all with it being excluded from windows distributions of rocm. Others then decided to enable rocwmma on ci, i would not have done so. |
I will give it another try, but ultimately this should not block #16221 |
I have been giving it a try, but it seems more trouble than it is worth. Considering that this is a header-only library and the only thing that the cmake script does (for us) is generate The rocmma version seems to be solely stored in $env:ROCWMMA_VERSION = if ((Get-Content -Path "CMakeLists.txt" -Raw) -match 'set \( VERSION_STRING\s+"?([0-9.]+)[^0-9.]+') { $matches[1] } |
e274516
to
fcda2dd
Compare
I dont really like this hack much at all, even though this solves the intimidate problem, ultimately what we are doing here on ci is just plain wrong and there is no guarantee that problem remains confined to the missing IMO we should just accept that rocwmma is not supported on windows by amd atm, disable it for now and then petition amd to support rocwmma on windows and add it to the rocm distribution. |
IMO we cannot make suboptimal releases for Windows just because it is not convenient, because otherwise the result will be that people will be forced to look for the releases elsewhere. It would fracture the community and damage the reputation of llama.cpp. I don't think it is too concerning to add this hack to the HIP backend, and in any case my understanding is that @JohannesGaessler plans to replace WMMA entirely with mma, so at worst it would be temporary. |
I have already purchased and installed an RX 9060 XT, I've also ordered a MI100 that is set to arrive in a few weeks. I intend to add support for the AMD "WMMA" instructions to the kernel in On CDNA (MI100) there are "MFMA" instructions which work a bit differently but the end result should be the same as with the AMD WMMA instructions. So yes, if you give me a few weeks I will likely be able to make it so that rocWMMA can be removed entirely. The WMMA kernel will still be needed for V100s, I'm in the process of purchasing one of those for development. I don't think it would make sense to maintain the WMMA kernel only for MUSA. |
@JohannesGaessler neat that you also acquired a mi100, i look forward to collaborating on that :) |
@slaren Ok how about this (see latest change): lets just unpack the ubuntu .deb pacakge on windows, that way we get a pre-built rocwmma with correct version header on the windows ci. Thats still a hack, but at least this way the hack is contained to just the ci. |
I will still petition amd to support rocwmma on windows |
The ubuntu packages are not likely to be updated. The reason we are using the |
@slaren that is incorrect, the version previously used on windows on ci was We can use rocwmma from rocm 7.0 which is released and which dose include support for strix halo and gfx12, but which is broken on cdna. |
That is the CI, but the releases are built separately in |
If that's available for Windows that could work, but I don't see it in https://www.amd.com/en/developer/resources/rocm-hub/hip-sdk.html |
no version of rocwmma is available for windows its not supported on windows at all, however we can use the ubuntu 7.0 package. |
on windows we now windows install rocwmma from ubuntu pacakges
2289b76
to
99e7e0f
Compare
@slaren i believe its in a state ready to be merged now. |
You have to test on your own fork's master branch. |
That worked, so The releas workflow is also tested now. |
rocwmma is also used in the docker container llama.cpp/.devops/rocm.Dockerfile Line 39 in 2a9b633
|
OK, this should be sovable by simply updateing the rocm we build the docker image against to Version 7.0. In the future, if wie are completely unwilling to accept regressions in released versions we can not fly this close to the sun. You where building for an archtecture not supported by the rocm version used (6.4+gfx1152) using the development branch of a notoriously unstable library (not even pinned to a commit) that is fairly tightly coupled to the underlieing platform on an os that this library is not supported on. That's just way to close to the sun to then insist on stability. |
Did you mean gfx1151? As far as I can tell, that is officially supported by ROCm on Windows. Generally, it is the responsibility of whoever wants to merge a change to fix anything broken by that change. I know it can be frustrating to have to fix things that you don't care about, but we can't just leave things knowingly in a broken state. |
it is not for the linux docker build where we also build for it. Its not about fixing things that i don't care about, its about maintaining unsupported combinations of dependencies. On cuda this would not even be a question. If cuda dosent support X on Y we would not go around building development branches of things (since its closed source) to make it work. While we can do that sort of thing with rocm we absolutely should not if the goal is being stable and regression free. Its also about the understanding that the rocwmma code be merged under the condition that i accept maintaining it which i in turn i accepted only with it disabled by default and then others going around that decision by enabeling it on CI. Which is fine as long as we keep the understanding that rocwmma working or not shal not be a release blocker. |
The docker build works fine when built against 7.0. Note that i disabled the cdna targets for now as we have a chicken and egg problem here with the other pr. After the other pr gets merged i will open another pr to re-enable the cdna targets. |
No description provided.