-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Windows Support #12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Unfortunately we've never tested windows paths, and it's not on the roadmap right now. |
Sorry if this is something you've already checked/covered @Phylliida but have you checked perhaps that you are building the code as C++20 (just guessing that the way constexpr and lambdas are used that it'll need to be that version of the language)? EDIT: also that comment you link to, that links to a Stack Overflow post appears to be unrelated to either issue thread; it's talking about something completely different (I'd hazard a guess the commenter remembered a #define being useful for array declaration and was sharing it, even though it did not relate to the specific defines you mentioned there) EDIT2: per https://learn.microsoft.com/en-us/cpp/c-runtime-library/math-constants it perhaps might be better to define _USE_MATH_DEFINES for like M_LOG2E to be defined EDIT3: actually it looks like the code was updated a day or so ago to ask that it be compiled with C++17 (not 20 as I had guessed) maybe check if you have this the recent commit also? 023c25d |
Nice, adding
to nvcc flags is a better alternative Compiling with c++17 isn't enough, I get the errors listed above. Rn I'm trying to get c++20 working, no success yet Edit: Ok looks like triton is a dependency, I'm trying out wheels prebuilt from here (scroll down to the bottom, extract the windows build, then pip install ___.whl for your version of python. I'm using 3.10 and Cuda 12.1) |
Okay I've successfully ran inference on Windows. I'm in python 3.9 cuda 12.1 I had to do the following things: (do all of the following in x64 Native Tools Command Prompt for VS 2019) compile causal-conv1d by adding
To the nvcc flags in setup.py (you may also need to run
) Next, we need to install triton. Download triton wheel from here scroll down to the bottom and download triton-dist windows-latest extract it then run
If you have a different version of python and cuda 11.8 you can use one from here instead though I haven't tested that Next, you need to get the compiled libraries triton needs. You can download them from here, add the bin directory to your PATH If you prefer to compile it yourself you can see the command here but be wary it'll take about 1-2 hours. Finally, I just modified
with
it would be better to use the kernel, but until we can get it compiling on windows we can use the reference implementation in pure python instead. With this setup I'm able to run inference using the 2.8b model (at fp16 or fp32) on a 3090. For example: Prompt:
Answer:
|
I think I found a workaround for compiling this package for windows (however, I have not tested the impact on performance). MSVC has a problem with diff --git a/csrc/selective_scan/selective_scan_fwd_kernel.cuh b/csrc/selective_scan/selective_scan_fwd_kernel.cuh
index 440a209..b3ef2a8 100644
--- a/csrc/selective_scan/selective_scan_fwd_kernel.cuh
+++ b/csrc/selective_scan/selective_scan_fwd_kernel.cuh
@@ -306,14 +306,14 @@ template<int kNThreads, int kNItems, typename input_t, typename weight_t>
void selective_scan_fwd_launch(SSMParamsBase ¶ms, cudaStream_t stream) {
// Only kNRows == 1 is tested for now, which ofc doesn't differ from previously when we had each block
// processing 1 row.
- constexpr int kNRows = 1;
+ const static int kNRows = 1;
BOOL_SWITCH(params.seqlen % (kNThreads * kNItems) == 0, kIsEvenLen, [&] {
BOOL_SWITCH(params.is_variable_B, kIsVariableB, [&] {
BOOL_SWITCH(params.is_variable_C, kIsVariableC, [&] {
BOOL_SWITCH(params.z_ptr != nullptr , kHasZ, [&] {
using Ktraits = Selective_Scan_fwd_kernel_traits<kNThreads, kNItems, kNRows, kIsEvenLen, kIsVariableB, kIsVariableC, kHasZ, input_t, weight_t>;
- // constexpr int kSmemSize = Ktraits::kSmemSize;
- constexpr int kSmemSize = Ktraits::kSmemSize + kNRows * MAX_DSTATE * sizeof(typename Ktraits::scan_t);
+ // const static int kSmemSize = Ktraits::kSmemSize;
+ const static int kSmemSize = Ktraits::kSmemSize + kNRows * MAX_DSTATE * sizeof(typename Ktraits::scan_t);
// printf("smem_size = %d\n", kSmemSize);
dim3 grid(params.batch, params.dim / kNRows);
auto kernel = &selective_scan_fwd_kernel<Ktraits>;
diff --git a/csrc/selective_scan/static_switch.h b/csrc/selective_scan/static_switch.h
index 7920ac0..87493ef 100644
--- a/csrc/selective_scan/static_switch.h
+++ b/csrc/selective_scan/static_switch.h
@@ -16,10 +16,10 @@
#define BOOL_SWITCH(COND, CONST_NAME, ...) \
[&] { \
if (COND) { \
- constexpr bool CONST_NAME = true; \
+ const static bool CONST_NAME = true; \
return __VA_ARGS__(); \
} else { \
- constexpr bool CONST_NAME = false; \
+ const static bool CONST_NAME = false; \
return __VA_ARGS__(); \
} \
}() With those changes I can compile the package. It seems to work in PyTorch, but like I mentioned, I haven't tested performance or correctness. 😅 |
working solution. (compiled but haven't trained) python 3.11.7 |
@Phylliida hello, thanks for your method. But I don't understand what to be added after removing "import selective_scan_cuda" .In the class SelectiveScanFn , There are " out, x, *rest = selective_scan_cuda.fwd(u, delta, A, B, C, D, delta_bias, delta_softplus)" and "du, ddelta, dA, dB, dC, dD, ddelta_bias, *rest = selective_scan_cuda.bwd(u, delta, A, B, C, D, delta_bias, dout, x, None, ctx.delta_softplus, ) " in the forward and backward fuctions . |
@RiceBunny1990 You can skip any modifications to |
Is there a simple way to get the training and inference (without recompiling the CUDA kernels) working on Windows without using WSL? |
@Phylliida @Grzego Thank you for your information, I have complied causal_conv1d 1.1.3.post1 and mamba 1.1.3.post1 successfully in python 3.10 + windows 11 x64 + torch 2.2 + cuda 12.1. However, when I try to import mamba, it will crash on
I have checked |
I'm wondering if the import error is because somewhere in the py code, it's specifically looking for a .so instead of a .dll? Haven't gotten to trying to compile this yet, still working on causal_conv1d. :P |
Hello any pre compiled wheel for this? I need for python 3.10 please. Thank you. Now this is required because newly published Zyphra/Zonos depending on this library |
I have wheels in the releases section of my audiolab project. Mileage may
vary.
…On Tue, Feb 11, 2025, 6:01 AM Furkan Gözükara ***@***.***> wrote:
Hello any pre compiled wheel for this? I need for python 3.10 please.
Thank you.
Now this is required because newly published Zyphra/Zonos depending on
this library
—
Reply to this email directly, view it on GitHub
<#12 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAMO4NDPDGZTL4MT7TJNC7D2PHRA5AVCNFSM6AAAAABQL4H72OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNJQGYYDEMBSGY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
https://github.com/d8ahazard/AudioLab/releases/tag/1.0.0 And forks with fixes here: https://github.com/d8ahazard/mamba Still can't get inference to run, getting issues with the triton version I found. Everything is compiled for CU124, I feel like we're close here. |
awesome i hope works with python 3.10, triton 3.2 - https://github.com/woct0rdho/triton-windows/releases/tag/v3.2.0-windows.post9 cuda 12.4 ty so much @d8ahazard |
Update: Submitted pull requests to this repo and causal-conv1d to add windows support. |
Legend |
Awesome, thanks, just came across this thread and it's what I was looking for. I'm running python 3.11, upgraded to cuda 12.6, using torch 2.6, any wheels for that? I'm familiar with the dependencies somewhat, let me know if I can help. See example here: |
+1 Hope Windows gets full support, wanting to experiment :) |
true legend |
Uh oh!
There was an error while loading. Please reload this page.
I'm able to compile causal-conv1d by adding
To the nvcc flags.
When compiling mamba, after adding
-DWIN32_LEAN_AND_MEAN
to nvcc flags, I find I need to addTo selective_scan_common.h
Then it can get a little further, however it raises the following errors:
This might be related to this issue, something about the windows compiler being more strict. However the intervention is probably gonna be a little more involved and I haven't had much luck yet
The text was updated successfully, but these errors were encountered: