Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue] Triton Compilation Error in Unsloth Fine-Tuning Script on Kernel 5.4.0 #1336

Open
gityeop opened this issue Nov 25, 2024 · 14 comments
Open
Labels
fixed - pending confirmation Fixed, waiting for confirmation from poster

Comments

@gityeop
Copy link

gityeop commented Nov 25, 2024

Description

When trying to run Unsloth fine-tuning script, encountering a Triton compilation error related to ReduceOpToLLVM.cpp.

Error Message

python /data/ephemeral/home/unsloth_example.py
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2024.11.9: Fast Llama patching. Transformers = 4.46.3
.                                                                             \\   /|    GPU: Tesla V100-SXM2-32GB. Max memory: 31.739 GB. Platform = 
Linux.                                                                     O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 7.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are re
d colored!                                                                 Unsloth 2024.11.9 patched 32 layers with 32 QKV layers, 32 O layers and 32 
MLP layers.                                                                Detected kernel version 5.4.0, which is below the recommended minimum of 5.
5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.                                    max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 210,289 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040
  0%|                                               | 0/60 [00:00<?, ?it/s]
python: /project/lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp:31: virtual mlir::LogicalResult {anonymous}::ReduceOpConversion::matchAndRewrite(mlir::triton::ReduceOp, mlir::ConvertOpToLLVMPattern<mlir::triton::ReduceOp>::OpAdaptor, mlir::ConversionPatternRewriter&) const: Assertion `helper.isSupportedLayout() && "Unexpected srcLayout in ReduceOpConversion"' failed.   Aborted (core dumped)

System Information

  • OS Kernel: 5.4.0-99-generic
  • GPU: Tesla V100-SXM2-32GB
  • CUDA Version: 12.2
  • Driver Version: 535.161.08
  • GPU Memory: 32GB

Code

from unsloth import FastLanguageModel 
from unsloth import is_bfloat16_supported
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
max_seq_length = 2048 # Supports RoPE Scaling interally, so choose any!
# Get LAION dataset
url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"
dataset = load_dataset("json", data_files = {"train" : url}, split = "train")

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-v0.3-bnb-4bit",      # New Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit",           # Llama-3 15 trillion tokens model 2x faster!
    "unsloth/llama-3-8b-Instruct-bnb-4bit",
    "unsloth/llama-3-70b-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct",        # Phi-3 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",             # Gemma 2.2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
)

# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    max_seq_length = max_seq_length,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    tokenizer = tokenizer,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        max_steps = 60,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        output_dir = "outputs",
        optim = "adamw_8bit",
        seed = 3407,
    ),
)
trainer.train()

# Go to https://github.com/unslothai/unsloth/wiki for advanced tips like
# (1) Saving to GGUF / merging to 16bit for vLLM
# (2) Continued training from a saved LoRA adapter
# (3) Adding an evaluation loop / OOMs
# (4) Customized chat templates

Additional Context

  • The error occurs during model initialization
  • Kernel version (5.4.0) is below the recommended minimum of 5.5.0
  • Using Unsloth's pre-quantized 4-bit model
  • Attempting to run on a single GPU setup

Steps to Reproduce

  1. Set up conda environment with PyTorch and CUDA
  2. Install Unsloth
  3. Run the example script for fine-tuning
  4. Error occurs during model initialization phase

Questions

  1. Is this error related to the kernel version being below the recommended minimum (5.4.0 < 5.5.0)?
  2. Are there any specific version requirements or compatibility issues with Triton that need to be addressed?
  3. Are there any workarounds available for systems that cannot upgrade their kernel version?
@chengju-zhou
Copy link

same issue on V100. but it works fine on T4

@ergosumdre
Copy link

I also have a V100 and I'm getting this error too.

@LiaoPan
Copy link

LiaoPan commented Nov 26, 2024

I also encountered the same error on v100.

@LiaoPan
Copy link

LiaoPan commented Nov 26, 2024

I also encountered the same error on v100.

Temporary solution:

Perhaps change the version of triton, but it will raise some warnings.
$ pip install triton==2.3.0

wait for the final solution.

@hykilpikonna
Copy link

I also encountered the same error on v100.

Temporary solution:

Perhaps change the version of triton, but it will raise some warnings.
$ pip install triton==2.3.0

wait for the final solution.

Which torch version did you use? It seems that torch 2.5.1 isn't compatible

unsloth_env/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 195, in get_system                                                                         from triton.compiler.compiler import triton_key                                                                                                                                             ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler' (unsloth_env/lib/python3.11/site-packages/triton/compiler/compiler.py)```

@LiaoPan
Copy link

LiaoPan commented Nov 26, 2024

@hykilpikonna pytorh '2.4.0+cu121'

@danielhanchen
Copy link
Contributor

Apologies everyone! @LiaoPan @hykilpikonna @ergosumdre @gityeop @chengju-zhou I added a flag to disable some other kernels - I'm unsure if it worked though.

Torch 2.5 and torch 2.4 should be now supported - sadly Colab got rid of V100s so I can't test them - so I'm assuming a specific kernel from Apple's Cut Cross Entropy package is the one causing the issues.

Please try updating Unsloth without dependencies if that works!

pip uninstall unsloth unsloth-zoo
pip install --upgrade --no-cache-dir --no-deps unsloth unsloth-zoo

@danielhanchen
Copy link
Contributor

By the way to get Torch 2.4 - simply run wget -qO- https://raw.githubusercontent.com/unslothai/unsloth/main/unsloth/_auto_install.py | python - to get the optimal installation command

@danielhanchen danielhanchen added the fixed - pending confirmation Fixed, waiting for confirmation from poster label Nov 26, 2024
@ergosumdre
Copy link

Confirmed working. Thanks!

By the way to get Torch 2.4 - simply run wget -qO- https://raw.githubusercontent.com/unslothai/unsloth/main/unsloth/_auto_install.py | python - to get the optimal installation command

@hykilpikonna
Copy link

hykilpikonna commented Nov 26, 2024

Please try updating Unsloth without dependencies if that works!

pip uninstall unsloth unsloth-zoo
pip install --upgrade --no-cache-dir --no-deps unsloth unsloth-zoo

Hmm... unsloth 2024.11.11 + unsloth-zoo 2024.11.8 is showing the same error for me.

[INFO|trainer.py:698] 2024-11-27 03:22:10,567 >> Using auto half precision backend
[WARNING|<string>:208] 2024-11-27 03:22:10,856 >> ==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 5,700 | Num Epochs = 3
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 1,068
 "-____-"     Number of trainable parameters = 21,472,256
  0%|                                                                                                                                                                  | 0/1068 [00:00<?, ?it/s]python3.11: 
  /project/lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp:31: virtual mlir::LogicalResult {anonymous}::ReduceOpConversion::matchAndRewrite(mlir::triton::ReduceOp, mlir::ConvertOpToLLVMPattern<mlir::triton::ReduceOp>::OpAdaptor, mlir::ConversionPatternRewriter&) const: Assertion `helper.isSupportedLayout() && "Unexpected srcLayout in ReduceOpConversion"' failed.

Install log:

Wed 11-27 03:21 Camellia 🐱 /d/sekai/llm/unsloth
> pip install --upgrade --no-cache-dir --no-deps unsloth unsloth-zoo
Looking in indexes: https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
Collecting unsloth
  Downloading https://mirrors.tuna.tsinghua.edu.cn/pypi/web/packages/5a/e2/b4a6f88c6cb6a5293f0c3111e2736149236c3062d2f3bc61ad60c7549a98/unsloth-2024.11.11-py3-none-any.whl (167 kB)
Collecting unsloth-zoo
  Downloading https://mirrors.tuna.tsinghua.edu.cn/pypi/web/packages/bb/a7/b142491c673140d1d62c28d4720d85b94d9577d7d6fb8534a4e6bb5e1909/unsloth_zoo-2024.11.8-py3-none-any.whl (59 kB)
Installing collected packages: unsloth-zoo, unsloth
Successfully installed unsloth-2024.11.11 unsloth-zoo-2024.11.8

pip freeze:

accelerate==1.0.1
aiofiles==23.2.1
aiohappyeyeballs==2.4.3
aiohttp==3.11.7
aiosignal==1.3.1
annotated-types==0.7.0
anyio==4.6.2.post1
attrs==24.2.0
av==13.1.0
bitsandbytes==0.44.1
certifi==2024.8.30
charset-normalizer==3.4.0
click==8.1.7
contourpy==1.3.1
cut-cross-entropy==24.11.4
cycler==0.12.1
datasets==3.1.0
dill==0.3.8
docstring_parser==0.16
einops==0.8.0
fastapi==0.115.5
ffmpy==0.4.0
filelock @ file:///home/conda/feedstock_root/build_artifacts/filelock_1726613473834/work
fire==0.7.0
fonttools==4.55.0
frozenlist==1.5.0
fsspec==2024.9.0
gmpy2 @ file:///home/conda/feedstock_root/build_artifacts/gmpy2_1725379831219/work
gradio==4.44.1
gradio_client==1.3.0
h11==0.14.0
hf_transfer==0.1.8
httpcore==1.0.7
httpx==0.27.2
huggingface-hub==0.26.2
idna==3.10
importlib_resources==6.4.5
jieba==0.42.1
Jinja2 @ file:///home/conda/feedstock_root/build_artifacts/jinja2_1715127149914/work
joblib==1.4.2
kiwisolver==1.4.7
-e git+https://github.com/hiyouga/LLaMA-Factory.git@00031b1a66ade1c2665ce7a069a756cccbcb07f1#egg=llamafactory
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.9.2
mdurl==0.1.2
modelscope==1.20.1
mpmath @ file:///home/conda/feedstock_root/build_artifacts/mpmath_1678228039184/work
multidict==6.1.0
multiprocess==0.70.16
networkx @ file:///home/conda/feedstock_root/build_artifacts/bld/rattler-build_networkx_1731521053/work
nltk==3.9.1
numpy==1.26.4
orjson==3.10.12
packaging==24.2
pandas==2.2.3
peft==0.12.0
pillow==10.4.0
propcache==0.2.0
protobuf==3.20.3
psutil==6.1.0
pyarrow==18.1.0
pydantic==2.10.1
pydantic_core==2.27.1
pydub==0.25.1
Pygments==2.18.0
pyparsing==3.2.0
python-dateutil==2.9.0.post0
python-multipart==0.0.17
pytz==2024.2
PyYAML @ file:///home/conda/feedstock_root/build_artifacts/pyyaml_1725456139051/work
regex==2024.11.6
requests==2.32.3
rich==13.9.4
rouge-chinese==1.0.3
ruff==0.8.0
safetensors==0.4.5
scipy==1.14.1
semantic-version==2.10.0
sentencepiece==0.2.0
shellingham==1.5.4
shtab==1.7.1
six==1.16.0
sniffio==1.3.1
sse-starlette==2.1.3
starlette==0.41.3
sympy==1.13.1
termcolor==2.5.0
tiktoken==0.8.0
tokenizers==0.20.3
tomlkit==0.12.0
torch==2.5.1
tqdm==4.67.1
transformers==4.46.1
triton==3.1.0
trl==0.8.6
typeguard==4.4.1
typer==0.13.1
typing_extensions @ file:///home/conda/feedstock_root/build_artifacts/typing_extensions_1717802530399/work
tyro==0.8.14
tzdata==2024.2
unsloth==2024.11.11
unsloth_zoo==2024.11.8
urllib3==2.2.3
uvicorn==0.32.1
websockets==12.0
xformers==0.0.28.post3
xxhash==3.5.0
yarl==1.18.0

@hykilpikonna
Copy link

Using Torch 2.4.0 did not solve the issue. Complete installation:

conda create --name unsloth_240 python=3.10 pytorch=2.4.0 pytorch-cuda=12.1 \
    cudatoolkit xformers -c pytorch -c nvidia -c xformers -y
conda activate unsloth_240
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps "trl<0.9.0" peft accelerate bitsandbytes
# Then installed LLaMA-Factory
cd LLaMA-Factory
pip install -e ".[torch,metrics]"

Using Torch 2.3.0 with Triton 2.3.0 led to a different error:

  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/_inductor/graph.py", line 612, in run
    return super().run(*args)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/fx/interpreter.py", line 145, in run
    self.env[node] = self.run_node(node)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/_inductor/graph.py", line 957, in run_node
    result = super().run_node(n)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/fx/interpreter.py", line 202, in run_node
    return getattr(self, n.op)(n.target, args, kwargs)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/_inductor/graph.py", line 819, in call_function
    raise LoweringException(e, target, args, kwargs).with_traceback(
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/_inductor/graph.py", line 816, in call_function
    out = lowerings[target](*args, **kwargs)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/_inductor/lowering.py", line 296, in wrapped
    out = decomp_fn(*args, **kwargs)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/_inductor/kernel/bmm.py", line 103, in tuned_bmm
    return autotune_select_algorithm("bmm", choices, [mat1, mat2], layout)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py", line 1146, in autotune_select_algorithm
    return _ALGORITHM_SELECTOR_CACHE(*args, **kwargs)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py", line 896, in __call__
    timings = self.lookup(
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 296, in lookup
    timings = benchmark(choices)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py", line 887, in autotune
    return make_benchmark_fn()(choices)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py", line 997, in benchmark_in_current_process
    timing = benchmark_choice_in_current_process(choice)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py", line 987, in benchmark_choice_in_current_process
    result = choice.benchmark(*example_inputs, out=out)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py", line 687, in benchmark
    return self.bmreq.benchmark(*args, output_tensor=out)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/_inductor/autotune_process.py", line 455, in benchmark
    out = do_bench(fn)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/_inductor/utils.py", line 170, in do_bench
    return triton_do_bench(*args, **kwargs)[0]
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/triton/testing.py", line 102, in do_bench
    fn()
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/triton/runtime/jit.py", line 416, in run
    self.cache[device][key] = compile(
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/triton/compiler/compiler.py", line 193, in compile
    next_module = compile_ir(module, metadata)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/triton/compiler/backends/cuda.py", line 199, in <lambda>
    stages["llir"] = lambda src, metadata: self.make_llir(src, metadata, options, self.capability)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/triton/compiler/backends/cuda.py", line 173, in make_llir
    ret = translate_triton_gpu_to_llvmir(src, capability, tma_infos, runtime.TARGET.NVVM)
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
LoweringException: IndexError: map::at
  target: aten.bmm.default
  args[0]: TensorBox(
    View(
      StorageBox(
        ComputedBuffer(name='buf5', layout=FixedLayout('cuda', torch.float16, size=[s0, 8, s1, s1], stride=[8*s1**2, s1**2, s1, 1]), data=Pointwise(
          'cuda',
          torch.float16,
          def inner_fn(index):
              i0, i1, i2, i3 = index
              tmp0 = ops.load(buf2, i3 + i1 * s1**2 + i2 * s1 + 8 * i0 * s1**2)
              tmp1 = ops.constant(50.0, torch.float16)
              tmp2 = tmp0 / tmp1
              tmp3 = ops.tanh(tmp2)
              tmp4 = ops.constant(50.0, torch.float16)
              tmp5 = tmp3 * tmp4
              tmp6 = ops.load(arg8_1, i3 + i2 * s9)
              tmp7 = tmp5 + tmp6
              tmp8 = ops.to_dtype(tmp7, torch.float32, src_dtype=torch.float16)
              tmp9 = ops.load(buf3, i2 + i1 * s1 + 8 * i0 * s1)
              tmp10 = tmp8 - tmp9
              tmp11 = ops.exp(tmp10)
              tmp12 = ops.load(buf4, i2 + i1 * s1 + 8 * i0 * s1)
              tmp13 = tmp11 / tmp12
              tmp14 = ops.to_dtype(tmp13, torch.float16, src_dtype=torch.float32)
              return tmp14
          ,
          ranges=[s0, 8, s1, s1],
          origin_node=expand_4,
          origins={convert_element_type_3, tanh, div_1, exp, convert_el...
        ))
      ),
      size=[8*s0, s1, s1],
      reindex=lambda i0, i1, i2: [ModularIndexing(i0, 8, s0), ModularIndexing(i0, 1, 8), i1, i2],
      origins={convert_element_type_3, tanh, div_1, exp, convert_el...
    )
  )
  args[1]: TensorBox(
    View(
      View(
        StorageBox(
          ComputedBuffer(name='buf6', layout=FixedLayout('cuda', torch.float16, size=[s0, 4, 2, s1, 256], stride=[2048*s1, 512*s1, 256*s1, 256, 1]), data=Pointwise(
            'cuda',
            torch.float16,
            def inner_fn(index):
                i0, i1, i2, i3, i4 = index
                tmp0 = ops.load(arg5_1, i4 + 256 * i1 + 1024 * i3 + 1024 * i0 * s1)
                return tmp0
            ,
            ranges=[s0, 4, 2, s1, 256],
            origin_node=clone_1,
            origins={clone_1}
          ))
        ),
        size=[s0, 8, s1, 256],
        reindex=lambda i0, i1, i2, i3: [i0, ModularIndexing(i1, 2, 4), ModularIndexing(i1, 1, 2), i2, i3],
        origins={clone_1, view_1}
      ),
      size=[8*s0, s1, 256],
      reindex=lambda i0, i1, i2: [ModularIndexing(i0, 8, s0), ModularIndexing(i0, 1, 8), i1, i2],
      origins={view_6}
    )
  )

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

@danielhanchen
Copy link
Contributor

@hykilpikonna Sorry about that - do you know GPU, CUDA version etc - could you take a screenshot of the stats section Unsloth prints out thanks

@ergosumdre
Copy link

Using Torch 2.4.0 did not solve the issue. Complete installation:

conda create --name unsloth_240 python=3.10 pytorch=2.4.0 pytorch-cuda=12.1 \
    cudatoolkit xformers -c pytorch -c nvidia -c xformers -y
conda activate unsloth_240
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps "trl<0.9.0" peft accelerate bitsandbytes
# Then installed LLaMA-Factory
cd LLaMA-Factory
pip install -e ".[torch,metrics]"

Using Torch 2.3.0 with Triton 2.3.0 led to a different error:

  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/_inductor/graph.py", line 612, in run
    return super().run(*args)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/fx/interpreter.py", line 145, in run
    self.env[node] = self.run_node(node)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/_inductor/graph.py", line 957, in run_node
    result = super().run_node(n)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/fx/interpreter.py", line 202, in run_node
    return getattr(self, n.op)(n.target, args, kwargs)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/_inductor/graph.py", line 819, in call_function
    raise LoweringException(e, target, args, kwargs).with_traceback(
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/_inductor/graph.py", line 816, in call_function
    out = lowerings[target](*args, **kwargs)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/_inductor/lowering.py", line 296, in wrapped
    out = decomp_fn(*args, **kwargs)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/_inductor/kernel/bmm.py", line 103, in tuned_bmm
    return autotune_select_algorithm("bmm", choices, [mat1, mat2], layout)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py", line 1146, in autotune_select_algorithm
    return _ALGORITHM_SELECTOR_CACHE(*args, **kwargs)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py", line 896, in __call__
    timings = self.lookup(
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 296, in lookup
    timings = benchmark(choices)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py", line 887, in autotune
    return make_benchmark_fn()(choices)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py", line 997, in benchmark_in_current_process
    timing = benchmark_choice_in_current_process(choice)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py", line 987, in benchmark_choice_in_current_process
    result = choice.benchmark(*example_inputs, out=out)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py", line 687, in benchmark
    return self.bmreq.benchmark(*args, output_tensor=out)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/_inductor/autotune_process.py", line 455, in benchmark
    out = do_bench(fn)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/torch/_inductor/utils.py", line 170, in do_bench
    return triton_do_bench(*args, **kwargs)[0]
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/triton/testing.py", line 102, in do_bench
    fn()
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/triton/runtime/jit.py", line 416, in run
    self.cache[device][key] = compile(
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/triton/compiler/compiler.py", line 193, in compile
    next_module = compile_ir(module, metadata)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/triton/compiler/backends/cuda.py", line 199, in <lambda>
    stages["llir"] = lambda src, metadata: self.make_llir(src, metadata, options, self.capability)
  File "/d/sekai/mamba/envs/unsloth_230/lib/python3.10/site-packages/triton/compiler/backends/cuda.py", line 173, in make_llir
    ret = translate_triton_gpu_to_llvmir(src, capability, tma_infos, runtime.TARGET.NVVM)
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
LoweringException: IndexError: map::at
  target: aten.bmm.default
  args[0]: TensorBox(
    View(
      StorageBox(
        ComputedBuffer(name='buf5', layout=FixedLayout('cuda', torch.float16, size=[s0, 8, s1, s1], stride=[8*s1**2, s1**2, s1, 1]), data=Pointwise(
          'cuda',
          torch.float16,
          def inner_fn(index):
              i0, i1, i2, i3 = index
              tmp0 = ops.load(buf2, i3 + i1 * s1**2 + i2 * s1 + 8 * i0 * s1**2)
              tmp1 = ops.constant(50.0, torch.float16)
              tmp2 = tmp0 / tmp1
              tmp3 = ops.tanh(tmp2)
              tmp4 = ops.constant(50.0, torch.float16)
              tmp5 = tmp3 * tmp4
              tmp6 = ops.load(arg8_1, i3 + i2 * s9)
              tmp7 = tmp5 + tmp6
              tmp8 = ops.to_dtype(tmp7, torch.float32, src_dtype=torch.float16)
              tmp9 = ops.load(buf3, i2 + i1 * s1 + 8 * i0 * s1)
              tmp10 = tmp8 - tmp9
              tmp11 = ops.exp(tmp10)
              tmp12 = ops.load(buf4, i2 + i1 * s1 + 8 * i0 * s1)
              tmp13 = tmp11 / tmp12
              tmp14 = ops.to_dtype(tmp13, torch.float16, src_dtype=torch.float32)
              return tmp14
          ,
          ranges=[s0, 8, s1, s1],
          origin_node=expand_4,
          origins={convert_element_type_3, tanh, div_1, exp, convert_el...
        ))
      ),
      size=[8*s0, s1, s1],
      reindex=lambda i0, i1, i2: [ModularIndexing(i0, 8, s0), ModularIndexing(i0, 1, 8), i1, i2],
      origins={convert_element_type_3, tanh, div_1, exp, convert_el...
    )
  )
  args[1]: TensorBox(
    View(
      View(
        StorageBox(
          ComputedBuffer(name='buf6', layout=FixedLayout('cuda', torch.float16, size=[s0, 4, 2, s1, 256], stride=[2048*s1, 512*s1, 256*s1, 256, 1]), data=Pointwise(
            'cuda',
            torch.float16,
            def inner_fn(index):
                i0, i1, i2, i3, i4 = index
                tmp0 = ops.load(arg5_1, i4 + 256 * i1 + 1024 * i3 + 1024 * i0 * s1)
                return tmp0
            ,
            ranges=[s0, 4, 2, s1, 256],
            origin_node=clone_1,
            origins={clone_1}
          ))
        ),
        size=[s0, 8, s1, 256],
        reindex=lambda i0, i1, i2, i3: [i0, ModularIndexing(i1, 2, 4), ModularIndexing(i1, 1, 2), i2, i3],
        origins={clone_1, view_1}
      ),
      size=[8*s0, s1, 256],
      reindex=lambda i0, i1, i2: [ModularIndexing(i0, 8, s0), ModularIndexing(i0, 1, 8), i1, i2],
      origins={view_6}
    )
  )

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

This is my setup:
triton = 2.1.0
torch = 2.5.1+cu118

@hykilpikonna
Copy link

hykilpikonna commented Nov 27, 2024

@hykilpikonna Sorry about that - do you know GPU, CUDA version etc - could you take a screenshot of the stats section Unsloth prints out thanks

My GPU and CUDA version:

> nvidia-smi
Wed Nov 27 13:09:44 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-SXM2-16GB           Off |   00000000:89:00.0 Off |                    0 |
| N/A   38C    P0             54W /  300W |     465MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla V100-SXM2-16GB           Off |   00000000:8A:00.0 Off |                    0 |
| N/A   34C    P0             53W /  300W |     309MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1082      C   /d/sekai/mamba/envs/vits/bin/python3          462MiB |
|    1   N/A  N/A    668530      C   ...i/mamba/envs/unsloth_240/bin/python        306MiB |
+-----------------------------------------------------------------------------------------+

Torch 2.4.0 Stats section:

> python unsloth_example.py
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2024.11.11: Fast Llama patching. Transformers:4.46.1.
   \\   /|    GPU: Tesla V100-SXM2-16GB. Max memory: 15.766 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.4.0. CUDA: 7.0. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.27.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth 2024.11.11 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.
max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 210,289 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040
  0%|                                                                                                                                                                                                                            | 0/60 [00:00<?, ?it/s]
python: /project/lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp:31: virtual mlir::LogicalResult {anonymous}::ReduceOpConversion::matchAndRewrite(mlir::triton::ReduceOp, mlir::ConvertOpToLLVMPattern<mlir::triton::ReduceOp>::OpAdaptor, mlir::ConversionPatternRewriter&) const: Assertion `helper.isSupportedLayout() && "Unexpected srcLayout in ReduceOpConversion"' failed.
zsh: IOT instruction (core dumped)  python unsloth_example.py

Torch 2.5.1:

> python unsloth_example.py
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2024.11.11: Fast Llama patching. Transformers:4.46.1.
   \\   /|    GPU: Tesla V100-SXM2-16GB. Max memory: 15.766 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1. CUDA: 7.0. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth 2024.11.11 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.
Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 210289/210289 [00:15<00:00, 13213.35 examples/s]
max_steps is given, it will override any value given in num_train_epochs
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 210,289 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040
  0%|                                                                                                                                                                                                                            | 0/60 [00:00<?, ?it/s]
python: /project/lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp:31: virtual mlir::LogicalResult {anonymous}::ReduceOpConversion::matchAndRewrite(mlir::triton::ReduceOp, mlir::ConvertOpToLLVMPattern<mlir::triton::ReduceOp>::OpAdaptor, mlir::ConversionPatternRewriter&) const: Assertion `helper.isSupportedLayout() && "Unexpected srcLayout in ReduceOpConversion"' failed.
zsh: IOT instruction (core dumped)  python unsloth_example.py

Since you said you don't have access to a V100 to test things, I would be happy to give you access to my system if you want to debug anything. If you can send me a ssh public key I'll give you ssh access.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fixed - pending confirmation Fixed, waiting for confirmation from poster
Projects
None yet
Development

No branches or pull requests

6 participants