Skip to content

Windows Support #12

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Phylliida opened this issue Dec 5, 2023 · 22 comments
Open

Windows Support #12

Phylliida opened this issue Dec 5, 2023 · 22 comments

Comments

@Phylliida
Copy link

Phylliida commented Dec 5, 2023

I'm able to compile causal-conv1d by adding

                        "-DWIN32_LEAN_AND_MEAN",

To the nvcc flags.

When compiling mamba, after adding -DWIN32_LEAN_AND_MEAN to nvcc flags, I find I need to add

#ifndef M_LOG2E
#define M_LOG2E 1.4426950408889634074
#endif

To selective_scan_common.h

Then it can get a little further, however it raises the following errors:

Y:\prog\python\thirdparty\mamba\csrc\selective_scan\selective_scan_bwd_kernel.cuh(493): error C2975: 'kIsEvenLen_': invalid template argument for 'Selective_Scan_bwd_kernel_traits', expected compile-time constant expression
Y:\prog\python\thirdparty\mamba\csrc\selective_scan\selective_scan_bwd_kernel.cuh(26): note: see declaration of 'kIsEvenLen_'
Y:\prog\python\thirdparty\mamba\csrc\selective_scan\selective_scan_bwd_kernel.cuh(521): note: see reference to function template instantiation 'void selective_scan_bwd_launch<32,4,input_t,weight_t>(SSMParamsBwd &,cudaStream_t)' being compiled
        with
        [
            input_t=c10::BFloat16,
            weight_t=c10::complex<float>
        ]
Y:\prog\python\thirdparty\mamba\csrc\selective_scan\selective_scan_bwd_bf16_complex.cu(9): note: see reference to function template instantiation 'void selective_scan_bwd_cuda<c10::BFloat16,c10::complex<float>>(SSMParamsBwd &,cudaStream_t)' being compiled
Y:\prog\python\thirdparty\mamba\csrc\selective_scan\selective_scan_bwd_kernel.cuh(493): error C2975: 'kIsVariableB_': invalid template argument for 'Selective_Scan_bwd_kernel_traits', expected compile-time constant expression
Y:\prog\python\thirdparty\mamba\csrc\selective_scan\selective_scan_bwd_kernel.cuh(26): note: see declaration of 'kIsVariableB_'
Y:\prog\python\thirdparty\mamba\csrc\selective_scan\selective_scan_bwd_kernel.cuh(493): error C2975: 'kIsVariableC_': invalid template argument for 'Selective_Scan_bwd_kernel_traits', expected compile-time constant expression
Y:\prog\python\thirdparty\mamba\csrc\selective_scan\selective_scan_bwd_kernel.cuh(26): note: see declaration of 'kIsVariableC_'
Y:\prog\python\thirdparty\mamba\csrc\selective_scan\selective_scan_bwd_kernel.cuh(493): error C2975: 'kDeltaSoftplus_': invalid template argument for 'Selective_Scan_bwd_kernel_traits', expected compile-time constant expression
Y:\prog\python\thirdparty\mamba\csrc\selective_scan\selective_scan_bwd_kernel.cuh(27): note: see declaration of 'kDeltaSoftplus_'
Y:\prog\python\thirdparty\mamba\csrc\selective_scan\selective_scan_bwd_kernel.cuh(493): error C2975: 'kHasZ_': invalid template argument for 'Selective_Scan_bwd_kernel_traits', expected compile-time constant expression
Y:\prog\python\thirdparty\mamba\csrc\selective_scan\selective_scan_bwd_kernel.cuh(27): note: see declaration of 'kHasZ_'

This might be related to this issue, something about the windows compiler being more strict. However the intervention is probably gonna be a little more involved and I haven't had much luck yet

@albertfgu
Copy link
Contributor

Unfortunately we've never tested windows paths, and it's not on the roadmap right now.

@nat42
Copy link

nat42 commented Dec 6, 2023

Sorry if this is something you've already checked/covered @Phylliida but have you checked perhaps that you are building the code as C++20 (just guessing that the way constexpr and lambdas are used that it'll need to be that version of the language)?

EDIT: also that comment you link to, that links to a Stack Overflow post appears to be unrelated to either issue thread; it's talking about something completely different (I'd hazard a guess the commenter remembered a #define being useful for array declaration and was sharing it, even though it did not relate to the specific defines you mentioned there)

EDIT2: per https://learn.microsoft.com/en-us/cpp/c-runtime-library/math-constants it perhaps might be better to define _USE_MATH_DEFINES for like M_LOG2E to be defined

EDIT3: actually it looks like the code was updated a day or so ago to ask that it be compiled with C++17 (not 20 as I had guessed) maybe check if you have this the recent commit also? 023c25d

@Phylliida
Copy link
Author

Phylliida commented Dec 6, 2023

Nice, adding

                        "-D_USE_MATH_DEFINES",

to nvcc flags is a better alternative

Compiling with c++17 isn't enough, I get the errors listed above. Rn I'm trying to get c++20 working, no success yet

Edit: Ok looks like triton is a dependency, I'm trying out wheels prebuilt from here (scroll down to the bottom, extract the windows build, then pip install ___.whl for your version of python. I'm using 3.10 and Cuda 12.1)

@Phylliida
Copy link
Author

Phylliida commented Dec 6, 2023

Okay I've successfully ran inference on Windows. I'm in python 3.9 cuda 12.1 I had to do the following things:

(do all of the following in x64 Native Tools Command Prompt for VS 2019)

compile causal-conv1d by adding

                        "-DWIN32_LEAN_AND_MEAN",

To the nvcc flags in setup.py

(you may also need to run

SET DISTUTILS_USE_SDK=1

)

Next, we need to install triton.

Download triton wheel from here scroll down to the bottom and download triton-dist windows-latest

extract it then run

pip3 install triton-2.1.0-cp39-cp39-win_amd64.whl

If you have a different version of python and cuda 11.8 you can use one from here instead though I haven't tested that

Next, you need to get the compiled libraries triton needs. You can download them from here, add the bin directory to your PATH

If you prefer to compile it yourself you can see the command here but be wary it'll take about 1-2 hours.

Finally, I just modified ops/selective_scan_interface.py to:

  1. Remove this line:
import selective_scan_cuda
  1. Replace
def selective_scan_fn(u, delta, A, B, C, D=None, z=None, delta_bias=None, delta_softplus=False,
                     return_last_state=False):
    """if return_last_state is True, returns (out, last_state)
    last_state has shape (batch, dim, dstate). Note that the gradient of the last state is
    not considered in the backward pass.
    """
    return SelectiveScanFn.apply(u, delta, A, B, C, D, z, delta_bias, delta_softplus, return_last_state)

with

def selective_scan_fn(u, delta, A, B, C, D=None, z=None, delta_bias=None, delta_softplus=False,
                     return_last_state=False):
    """if return_last_state is True, returns (out, last_state)
    last_state has shape (batch, dim, dstate). Note that the gradient of the last state is
    not considered in the backward pass.
    """
    return selective_scan_ref(u, delta, A, B, C, D, z, delta_bias, delta_softplus, return_last_state)

it would be better to use the kernel, but until we can get it compiling on windows we can use the reference implementation in pure python instead.

With this setup I'm able to run inference using the 2.8b model (at fp16 or fp32) on a 3090.

For example:

Prompt:

User: What is the answer to life the universe and everything? Oracle:

Answer:

I don't know. I'm just a computer.

@Grzego
Copy link

Grzego commented Dec 10, 2023

I think I found a workaround for compiling this package for windows (however, I have not tested the impact on performance). MSVC has a problem with constexpr and can't handle passing them to templates as arguments (see this and this). The workaround is to replace constexpr with const static.

diff --git a/csrc/selective_scan/selective_scan_fwd_kernel.cuh b/csrc/selective_scan/selective_scan_fwd_kernel.cuh
index 440a209..b3ef2a8 100644
--- a/csrc/selective_scan/selective_scan_fwd_kernel.cuh
+++ b/csrc/selective_scan/selective_scan_fwd_kernel.cuh
@@ -306,14 +306,14 @@ template<int kNThreads, int kNItems, typename input_t, typename weight_t>
 void selective_scan_fwd_launch(SSMParamsBase &params, cudaStream_t stream) {
     // Only kNRows == 1 is tested for now, which ofc doesn't differ from previously when we had each block
     // processing 1 row.
-    constexpr int kNRows = 1;
+    const static int kNRows = 1;
     BOOL_SWITCH(params.seqlen % (kNThreads * kNItems) == 0, kIsEvenLen, [&] {
         BOOL_SWITCH(params.is_variable_B, kIsVariableB, [&] {
             BOOL_SWITCH(params.is_variable_C, kIsVariableC, [&] {
                 BOOL_SWITCH(params.z_ptr != nullptr , kHasZ, [&] {
                     using Ktraits = Selective_Scan_fwd_kernel_traits<kNThreads, kNItems, kNRows, kIsEvenLen, kIsVariableB, kIsVariableC, kHasZ, input_t, weight_t>;
-                    // constexpr int kSmemSize = Ktraits::kSmemSize;
-                    constexpr int kSmemSize = Ktraits::kSmemSize + kNRows * MAX_DSTATE * sizeof(typename Ktraits::scan_t);
+                    // const static int kSmemSize = Ktraits::kSmemSize;
+                    const static int kSmemSize = Ktraits::kSmemSize + kNRows * MAX_DSTATE * sizeof(typename Ktraits::scan_t);
                     // printf("smem_size = %d\n", kSmemSize);
                     dim3 grid(params.batch, params.dim / kNRows);
                     auto kernel = &selective_scan_fwd_kernel<Ktraits>;
diff --git a/csrc/selective_scan/static_switch.h b/csrc/selective_scan/static_switch.h
index 7920ac0..87493ef 100644
--- a/csrc/selective_scan/static_switch.h
+++ b/csrc/selective_scan/static_switch.h
@@ -16,10 +16,10 @@
 #define BOOL_SWITCH(COND, CONST_NAME, ...)                                           \
     [&] {                                                                            \
         if (COND) {                                                                  \
-            constexpr bool CONST_NAME = true;                                        \
+            const static bool CONST_NAME = true;                                     \
             return __VA_ARGS__();                                                    \
         } else {                                                                     \
-            constexpr bool CONST_NAME = false;                                       \
+            const static bool CONST_NAME = false;                                    \
             return __VA_ARGS__();                                                    \
         }                                                                            \
     }()

With those changes I can compile the package. It seems to work in PyTorch, but like I mentioned, I haven't tested performance or correctness. 😅

@Jacky56
Copy link

Jacky56 commented Feb 1, 2024

I think I found a workaround for compiling this package for windows (however, I have not tested the impact on performance). MSVC has a problem with constexpr and can't handle passing them to templates as arguments (see this and this). The workaround is to replace constexpr with const static.

diff --git a/csrc/selective_scan/selective_scan_fwd_kernel.cuh b/csrc/selective_scan/selective_scan_fwd_kernel.cuh
index 440a209..b3ef2a8 100644
--- a/csrc/selective_scan/selective_scan_fwd_kernel.cuh
+++ b/csrc/selective_scan/selective_scan_fwd_kernel.cuh
@@ -306,14 +306,14 @@ template<int kNThreads, int kNItems, typename input_t, typename weight_t>
 void selective_scan_fwd_launch(SSMParamsBase &params, cudaStream_t stream) {
     // Only kNRows == 1 is tested for now, which ofc doesn't differ from previously when we had each block
     // processing 1 row.
-    constexpr int kNRows = 1;
+    const static int kNRows = 1;
     BOOL_SWITCH(params.seqlen % (kNThreads * kNItems) == 0, kIsEvenLen, [&] {
         BOOL_SWITCH(params.is_variable_B, kIsVariableB, [&] {
             BOOL_SWITCH(params.is_variable_C, kIsVariableC, [&] {
                 BOOL_SWITCH(params.z_ptr != nullptr , kHasZ, [&] {
                     using Ktraits = Selective_Scan_fwd_kernel_traits<kNThreads, kNItems, kNRows, kIsEvenLen, kIsVariableB, kIsVariableC, kHasZ, input_t, weight_t>;
-                    // constexpr int kSmemSize = Ktraits::kSmemSize;
-                    constexpr int kSmemSize = Ktraits::kSmemSize + kNRows * MAX_DSTATE * sizeof(typename Ktraits::scan_t);
+                    // const static int kSmemSize = Ktraits::kSmemSize;
+                    const static int kSmemSize = Ktraits::kSmemSize + kNRows * MAX_DSTATE * sizeof(typename Ktraits::scan_t);
                     // printf("smem_size = %d\n", kSmemSize);
                     dim3 grid(params.batch, params.dim / kNRows);
                     auto kernel = &selective_scan_fwd_kernel<Ktraits>;
diff --git a/csrc/selective_scan/static_switch.h b/csrc/selective_scan/static_switch.h
index 7920ac0..87493ef 100644
--- a/csrc/selective_scan/static_switch.h
+++ b/csrc/selective_scan/static_switch.h
@@ -16,10 +16,10 @@
 #define BOOL_SWITCH(COND, CONST_NAME, ...)                                           \
     [&] {                                                                            \
         if (COND) {                                                                  \
-            constexpr bool CONST_NAME = true;                                        \
+            const static bool CONST_NAME = true;                                     \
             return __VA_ARGS__();                                                    \
         } else {                                                                     \
-            constexpr bool CONST_NAME = false;                                       \
+            const static bool CONST_NAME = false;                                    \
             return __VA_ARGS__();                                                    \
         }                                                                            \
     }()

With those changes I can compile the package. It seems to work in PyTorch, but like I mentioned, I haven't tested performance or correctness. 😅

working solution. (compiled but haven't trained)

python 3.11.7
windows 10

@RiceBunny1990
Copy link

@Phylliida hello, thanks for your method. But I don't understand what to be added after removing "import selective_scan_cuda" .In the class SelectiveScanFn , There are " out, x, *rest = selective_scan_cuda.fwd(u, delta, A, B, C, D, delta_bias, delta_softplus)" and "du, ddelta, dA, dB, dC, dD, ddelta_bias, *rest = selective_scan_cuda.bwd(u, delta, A, B, C, D, delta_bias, dout, x, None, ctx.delta_softplus, ) " in the forward and backward fuctions .
Please help me.

@Grzego
Copy link

Grzego commented Feb 3, 2024

@RiceBunny1990 You can skip any modifications to ops/selective_scan_interface.py after you successfully compile mamba kernels on windows. Which should be possible after doing the changes I posted previously.

@F286
Copy link

F286 commented Feb 4, 2024

Is there a simple way to get the training and inference (without recompiling the CUDA kernels) working on Windows without using WSL?

@lyhyl
Copy link

lyhyl commented Feb 5, 2024

@Phylliida @Grzego Thank you for your information, I have complied causal_conv1d 1.1.3.post1 and mamba 1.1.3.post1 successfully in python 3.10 + windows 11 x64 + torch 2.2 + cuda 12.1. However, when I try to import mamba, it will crash on import casual_conv1d_cuda, gives:

ImportError: DLL load failed while importing causal_conv1d_cuda: The specified module could not be found.

I have checked causal_conv1d_cuda.cp310-win_amd64.pyd's dependencies (AFAIK pyd is dll in windows), all its dependencies exist.
image
Any idea what causes it failed?

@megumitagaki
Copy link

感谢您的信息,我已经在 python 3.10 + windows 11 x64 + torch 2.2 + cuda 12.1 中成功编译了 causal_conv1d 1.1.3.post1 和 mamba 1.1.3.post1。但是,当我尝试导入 mamba 时,它会在 上崩溃,得到:import casual_conv1d_cuda

ImportError: DLL load failed while importing causal_conv1d_cuda: The specified module could not be found.

我已经检查了 的依赖项(AFAIK pyd 是 Windows 中的 dll),它的所有依赖项都存在。 图像知道是什么原因导致它失败了吗?causal_conv1d_cuda.cp310-win_amd64.pyd

hello。Have you solved this problem?

@d8ahazard
Copy link

I'm wondering if the import error is because somewhere in the py code, it's specifically looking for a .so instead of a .dll?

Haven't gotten to trying to compile this yet, still working on causal_conv1d. :P

@FurkanGozukara
Copy link

Hello any pre compiled wheel for this? I need for python 3.10 please. Thank you.

Now this is required because newly published Zyphra/Zonos depending on this library

@d8ahazard
Copy link

d8ahazard commented Feb 11, 2025 via email

@auto1111fan
Copy link

Hello any pre compiled wheel for this? I need for python 3.10 please. Thank you.

Now this is required because newly published Zyphra/Zonos depending on this library

https://github.com/d8ahazard/AudioLab/releases/tag/1.0.0

And forks with fixes here: https://github.com/d8ahazard/mamba
https://github.com/d8ahazard/causal-conv1d

Still can't get inference to run, getting issues with the triton version I found. Everything is compiled for CU124, I feel like we're close here.

@FurkanGozukara
Copy link

I have wheels in the releases section of my audiolab project. Mileage may
vary.

On Tue, Feb 11, 2025, 6:01 AM Furkan Gözükara @.***> wrote:
Hello any pre compiled wheel for this? I need for python 3.10 please.
Thank you.

Now this is required because newly published Zyphra/Zonos depending on
this library


Reply to this email directly, view it on GitHub
<#12 (comment)>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAMO4NDPDGZTL4MT7TJNC7D2PHRA5AVCNFSM6AAAAABQL4H72OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNJQGYYDEMBSGY
.
You are receiving this because you commented.Message ID:
@.***>

awesome i hope works with

python 3.10, triton 3.2 - https://github.com/woct0rdho/triton-windows/releases/tag/v3.2.0-windows.post9

cuda 12.4

ty so much @d8ahazard

@d8ahazard
Copy link

Update: Submitted pull requests to this repo and causal-conv1d to add windows support.

Dao-AILab/causal-conv1d#46
#692

@FurkanGozukara
Copy link

Update: Submitted pull requests to this repo and causal-conv1d to add windows support.

Dao-AILab/causal-conv1d#46 #692

Legend

@BBC-Esq
Copy link

BBC-Esq commented Feb 12, 2025

Awesome, thanks, just came across this thread and it's what I was looking for. I'm running python 3.11, upgraded to cuda 12.6, using torch 2.6, any wheels for that? I'm familiar with the dependencies somewhat, let me know if I can help. See example here:

woct0rdho/triton-windows#43

@SPOOKEXE
Copy link

+1 Hope Windows gets full support, wanting to experiment :)

@wakinpang
Copy link

Update: Submitted pull requests to this repo and causal-conv1d to add windows support.

Dao-AILab/causal-conv1d#46 #692

true legend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

17 participants