RuntimeError: stack expects each tensor to be equal size, but got... #740

dachenlian · 2023-10-18T02:09:51Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Training should continue after evaluation + logging to wandb

Current behaviour

I'm running Windows 11 in WSL2 and I get the following traceback:

[2023-10-18 03:25:44,364] [INFO] [axolotl.monkeypatch.mistral._prepare_decoder_attention_mask:113] [PID:2711681] [RANK:0] skipping sliding window mask, not broadcastable with attention mask
  0%|                                                                                                                         | 0/184 [00:14<?, ?it/s]
Traceback (most recent call last):
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/richard/github/axolotl/src/axolotl/cli/train.py", line 51, in <module>
    fire.Fire(do_cli)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/richard/github/axolotl/src/axolotl/cli/train.py", line 47, in do_cli
    train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
  File "/home/richard/github/axolotl/src/axolotl/train.py", line 118, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 1591, in train
    return inner_training_loop(
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 1984, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 2328, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 3094, in evaluate
    self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, output.metrics)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer_callback.py", line 388, in on_evaluate
    return self.call_event("on_evaluate", args, state, control, metrics=metrics)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer_callback.py", line 406, in call_event
    result = getattr(callback, event)(
  File "/home/richard/github/axolotl/src/axolotl/utils/callbacks.py", line 512, in on_evaluate
    log_table_from_dataloader("Eval", eval_dataloader)
  File "/home/richard/github/axolotl/src/axolotl/utils/callbacks.py", line 472, in log_table_from_dataloader
    predictions = trainer.model.generate(
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/peft/peft_model.py", line 1022, in generate
    outputs = self.base_model.generate(**kwargs)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1606, in generate
    return self.greedy_search(
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2454, in greedy_search
    outputs = self(
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 1045, in forward
    outputs = self.model(
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/richard/github/axolotl/src/axolotl/monkeypatch/mistral_attn_hijack_flash.py", line 536, in mistral_model_forward
    layer_outputs = decoder_layer(
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/richard/github/axolotl/src/axolotl/monkeypatch/mistral_attn_hijack_flash.py", line 614, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/richard/github/axolotl/src/axolotl/monkeypatch/mistral_attn_hijack_flash.py", line 206, in flashattn_forward
    qkv = torch.stack(
RuntimeError: stack expects each tensor to be equal size, but got [1, 32, 1, 128] at entry 0 and [1, 32, 4096, 128] at entry 1

Steps to reproduce

Enable wandb logging and val_set_size? I am unsure, but the most recent commit was supposed to fix this error?

Config yaml

base_model: Open-Orca/Mistral-7B-OpenOrca
base_model_config: Open-Orca/Mistral-7B-OpenOrca
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: true

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:
  - path: json
    data_files: data/sharegpt/oa-train-32k-axolotl-sharegpt.json
    type: sharegpt
dataset_prepared_path: data/sharegpt/last_run_prepared
val_set_size: 0.01
output_dir: ./oa-mistral-openorca-ckpt

torch_compile: true

sequence_len: 32768
sample_packing: true
pad_to_sequence_len: true
group_by_length: true

adapter: qlora

lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj

wandb_project: OA-Mistral-7B-OpenOrca
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model:

gradient_accumulation_steps: 32
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.00005

# Augmentation techniques
# NEFT https://arxiv.org/abs/2310.05914, set this to a number (paper default is 5) to add noise to embeddings
# currently only supported on Llama and Mistral
noisy_embedding_alpha: 5

train_on_inputs: false
group_by_length: true
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint: 
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

save_steps: 20
eval_batch_size: 1
warmup_steps: 10
eval_steps: 20
eval_table_size: 5
eval_table_max_new_tokens: 128
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

Possible solution

No response

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

3.10.12

axolotl branch-commit

main/a045db0

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

NanoCode012 · 2023-10-18T08:17:34Z

Could you try to pull latest? Seems duplicate of #737

dachenlian · 2023-10-18T10:09:08Z

I closed that one after I noticed there was a newer commit, assuming it would have fixed it. Turns it out I'm still running into the same problem after pulling the latest.

I did notice the other things that were mentioned in a045db0, such as the loss being lower and less VRAM usage.

winglian · 2023-10-19T02:44:18Z

@dachenlian can you try this with eval_sample_packing: false ?

winglian · 2023-10-19T02:46:24Z

@dachenlian also, how large/small is your dataset?

winglian · 2023-10-19T02:46:48Z

@NanoCode012 I think we may have to add a validation that if sample_packing is true and the eval table is enabled, that particular configuration is likely invalid.

NanoCode012 · 2023-10-22T16:19:13Z

Hello @dachenlian ! May I ask if the above comments solved it for you? We've just added validation for this in #769 for future clarification.

dachenlian · 2023-10-24T01:30:10Z

Hi! Sorry, my machine was busy training a model so I couldn't get back to you right away. My dataset has about 26000 training examples, with the max being 32K tokens.

I tried a quick debug by using a smaller 8K training set of 2 sample and an evaluation set of 2 sample and running evaluation after 1 step.

Strange thing is it worked in this case.
My config

base_model: Open-Orca/Mistral-7B-OpenOrca
base_model_config: Open-Orca/Mistral-7B-OpenOrca
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: true

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:
  - path: json
    data_files: data/sharegpt/oa-debug-8k-axolotl-sharegpt.json
    type: sharegpt
dataset_prepared_path: data/sharegpt/debug-last_run_prepared
val_set_size: .5
output_dir: ./debug-oa-mistral-openorca-ckpt

torch_compile: true

sequence_len: 32768
sample_packing: true
eval_sample_packing: true
pad_to_sequence_len: true
group_by_length: true

adapter: qlora

lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj

wandb_project: OA-Mistral-7B-OpenOrca
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model:

gradient_accumulation_steps: 32
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.00005

# Augmentation techniques
# NEFT https://arxiv.org/abs/2310.05914, set this to a number (paper default is 5) to add noise to embeddings
# currently only supported on Llama and Mistral
noisy_embedding_alpha: 5

train_on_inputs: false
group_by_length: true
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint: 
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

auto_resume_from_checkpoint: true

eval_batch_size: 1
eval_steps: 1
save_steps: 20
warmup_steps: 10
eval_table_size: 5
eval_table_max_new_tokens: 128
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

After switching eval_sample_packing: true I get an OutOfMemoryError

  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/richard/github/axolotl/src/axolotl/cli/train.py", line 51, in <module>
    fire.Fire(do_cli)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/richard/github/axolotl/src/axolotl/cli/train.py", line 47, in do_cli
    train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
  File "/home/richard/github/axolotl/src/axolotl/train.py", line 118, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 1591, in train
    return inner_training_loop(
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 1984, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 2328, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 3066, in evaluate
    output = eval_loop(
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 3255, in evaluation_loop
    loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 3474, in prediction_step
    loss, outputs = self.compute_loss(model, inputs, return_outputs=True)
  File "/home/richard/github/axolotl/src/axolotl/utils/trainer.py", line 311, in compute_loss
    return super().compute_loss(model, inputs, return_outputs=return_outputs)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 2801, in compute_loss
    outputs = model(**inputs)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/accelerate/utils/operations.py", line 636, in forward
    return model_forward(*args, **kwargs)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/accelerate/utils/operations.py", line 624, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/peft/peft_model.py", line 965, in forward
    return self.base_model(
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 106, in forward
    return self.model.forward(*args, **kwargs)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 1045, in forward
    outputs = self.model(
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/richard/github/axolotl/src/axolotl/monkeypatch/mistral_attn_hijack_flash.py", line 536, in mistral_model_forward
    layer_outputs = decoder_layer(
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/richard/github/axolotl/src/axolotl/monkeypatch/mistral_attn_hijack_flash.py", line 614, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/richard/github/axolotl/src/axolotl/monkeypatch/mistral_attn_hijack_flash.py", line 226, in flashattn_forward
    qkv_unpad, cu_seqlens_q, max_seqlen_q, _, output_pad_fn = generate_qkv(
  File "/home/richard/github/axolotl/src/axolotl/monkeypatch/mistral_attn_hijack_flash.py", line 337, in generate_qkv
    q_unpad, indices_q, cu_seqlens_q, max_seqlen_q = unpad_input(
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/flash_attn/bert_padding.py", line 118, in unpad_input
    index_first_axis(rearrange(hidden_states, "b s ... -> (b s) ..."), indices),
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/flash_attn/bert_padding.py", line 17, in forward
    return torch.gather(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7325.08 GiB (GPU 0; 47.99 GiB total capacity; 15.01 GiB already allocated; 28.03 GiB free; 17.51 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

After switching to longer training samples (token lengths: [22574, 16895]) and eval samples ([8070, 17368]) I still get an OutOfMemoryError with eval_sample_packing: false.

With eval_sample_packing: true, I get the following error:

[2023-10-24 09:18:05,167] [INFO] [axolotl.utils.dataloader.generate_batches:187] [PID:1059841] [RANK:0] 83b97b859aa5f81b2f0f86ba2a675efaf515ad2d5e2b8652cf2de7e1c2267350
[2023-10-24 09:18:05,167] [INFO] [axolotl.utils.dataloader._len_est:264] [PID:1059841] [RANK:0] packing_efficiency_estimate: 0.61 total_num_tokens per device: 25352
[2023-10-24 09:18:13,490] [INFO] [axolotl.monkeypatch.mistral._prepare_decoder_attention_mask:113] [PID:1059841] [RANK:0] skipping sliding window mask, not broadcastable with attention mask
[2023-10-24 09:18:22,842] [INFO] [axolotl.monkeypatch.mistral._prepare_decoder_attention_mask:113] [PID:1059841] [RANK:0] skipping sliding window mask, not broadcastable with attention mask
/opt/conda/conda-bld/pytorch_1682343967769/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [65540,0,0], thread: [64,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343967769/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [65540,0,0], thread: [65,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343967769/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [65540,0,0], thread: [66,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343967769/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [65540,0,0], thread: [67,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.

This goes on for a while...

0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343967769/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [65569,0,0], thread: [28,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343967769/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [65569,0,0], thread: [29,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343967769/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [65569,0,0], thread: [30,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343967769/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [65569,0,0], thread: [31,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
  0%|                                                                                                                                                  | 0/1 [00:18<?, ?it/s]
Traceback (most recent call last):
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/richard/github/axolotl/src/axolotl/cli/train.py", line 51, in <module>
    fire.Fire(do_cli)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/richard/github/axolotl/src/axolotl/cli/train.py", line 47, in do_cli
    train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
  File "/home/richard/github/axolotl/src/axolotl/train.py", line 118, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 1591, in train
    return inner_training_loop(
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 1984, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 2328, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 3094, in evaluate
    self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, output.metrics)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer_callback.py", line 388, in on_evaluate
    return self.call_event("on_evaluate", args, state, control, metrics=metrics)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer_callback.py", line 406, in call_event
    result = getattr(callback, event)(
  File "/home/richard/github/axolotl/src/axolotl/utils/callbacks.py", line 512, in on_evaluate
    log_table_from_dataloader("Eval", eval_dataloader)
  File "/home/richard/github/axolotl/src/axolotl/utils/callbacks.py", line 472, in log_table_from_dataloader
    predictions = trainer.model.generate(
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/peft/peft_model.py", line 1022, in generate
    outputs = self.base_model.generate(**kwargs)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1606, in generate
    return self.greedy_search(
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2454, in greedy_search
    outputs = self(
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 1045, in forward
    outputs = self.model(
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/richard/github/axolotl/src/axolotl/monkeypatch/mistral_attn_hijack_flash.py", line 536, in mistral_model_forward
    layer_outputs = decoder_layer(
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/richard/github/axolotl/src/axolotl/monkeypatch/mistral_attn_hijack_flash.py", line 614, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/richard/github/axolotl/src/axolotl/monkeypatch/mistral_attn_hijack_flash.py", line 270, in flashattn_forward
    ) = generate_qkv(
  File "/home/richard/github/axolotl/src/axolotl/monkeypatch/mistral_attn_hijack_flash.py", line 362, in generate_qkv
    v_unpad, _, _, _ = unpad_input(v, key_padding_mask)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/flash_attn/bert_padding.py", line 109, in unpad_input
    indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

dachenlian · 2023-10-24T09:32:07Z

I pulled the latest commit 20aa4b5 and tried training again.

This is my config:

base_model: ehartford/dolphin-2.1-mistral-7b
base_model_config: ehartford/dolphin-2.1-mistral-7b
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: true

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:
  - path: json
    data_files: data/sharegpt/oa-train-32k-axolotl-sharegpt.json
    type: sharegpt
    conversation: chatml
dataset_prepared_path: data/sharegpt/32k-last_run_prepared
val_set_size: 0.001
output_dir: ./oa-mistral-dolphin-ckpt

torch_compile: false

adapter: lora

sequence_len: 32768
sample_packing: true
pad_to_sequence_len: true

lora_r: 64
lora_alpha: 128
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj

wandb_project: OA-Mistral-7B-Dolphin
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model:

gradient_accumulation_steps: 32
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.00004

# Augmentation techniques
# NEFT https://arxiv.org/abs/2310.05914, set this to a number (paper default is 5) to add noise to embeddings
# currently only supported on Llama and Mistral
noisy_embedding_alpha: 5

train_on_inputs: false
group_by_length: true
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint: 
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
flash_attn_fuse_qkv: false # Whether to fuse QKV into a single operation
flash_attn_fuse_mlp: false # Whether to fuse part of the MLP into a single operation
# Whether to use scaled-dot-product attention
# https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html
sdp_attention: false


auto_resume_from_checkpoint: true

eval_batch_size: 1
eval_steps: 1
eval_sample_packing: false
save_steps: 20
warmup_steps: 10
eval_table_size: 5
eval_table_max_new_tokens: 128
debug: false
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

I set eval_steps: 1, val_set_size: 1, and eval_batch_size: 1 just to see if it would work.

This is my output:

{'loss': 1.6353, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.0}
  0%|                                                                                                     | 1/17729 [00:24<121:00:23, 24.57s/it]Traceback (most recent call last):
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/richard/github/axolotl/src/axolotl/cli/train.py", line 51, in <module>
    fire.Fire(do_cli)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/richard/github/axolotl/src/axolotl/cli/train.py", line 47, in do_cli
    train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
  File "/home/richard/github/axolotl/src/axolotl/train.py", line 116, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 1591, in train
    return inner_training_loop(
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 1984, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 2328, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 3062, in evaluate
    eval_dataloader = self.get_eval_dataloader(eval_dataset)
  File "/home/richard/github/axolotl/src/axolotl/core/trainer_builder.py", line 199, in get_eval_dataloader
    MultipackDistributedDataloader(
  File "/home/richard/github/axolotl/src/axolotl/utils/dataloader.py", line 156, in __init__
    dataset.data.column("position_ids")
  File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/datasets/table.py", line 390, in column
    return self.table.column(*args, **kwargs)
  File "pyarrow/table.pxi", line 1611, in pyarrow.lib._Tabular.column
  File "pyarrow/table.pxi", line 1547, in pyarrow.lib._Tabular._ensure_integer_index
KeyError: 'Field "position_ids" does not exist in schema'

winglian · 2023-10-25T00:12:21Z

can you add eval_sample_packing: false

viantirreau · 2023-10-25T15:40:09Z

Hi! I got the same error, also with eval_sample_packing: false. What seems to work to fix the evaluation part is to set sample_packing: false, although I don't know the negative repercussions of setting this to the training data as well.

My config.yml

base_model: PY007/TinyLlama-1.1B-intermediate-step-480k-1T

model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
is_llama_derived_model: true

load_in_8bit: true
load_in_4bit: false
strict: false

datasets:
- path: json
  type: alpaca
  data_files: ./tiny/alpaca_soft_and_manual_dataset.jsonl
dataset_prepared_path:
val_set_size: 0.1
output_dir: ./out

sequence_len: 8192
sample_packing: false # usually true, seems to fix the evaluation part
pad_to_sequence_len: true

adapter: lora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:

wandb_project: tiny
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model: end # "checkpoint" to log model to wandb Artifacts every `save_steps` or "end" to log only at the end of training

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 25
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
eval_steps: 2 # just to fail earlier, usually 20
eval_table_size: 5
eval_table_max_new_tokens: 500
eval_sample_packing: false
save_steps: 20
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
bos_token: "<s>"
eos_token: "</s>"
unk_token: "<unk>"

akjindal53244 · 2023-10-25T19:25:02Z

I am also facing the same error with latest master branch! Can we please fix it?

winglian · 2023-10-28T01:45:25Z

I set eval_steps: 1, val_set_size: 1, and eval_batch_size: 1 just to see if it would work.

setting val_set_size to 1 would never be exptected to work as this implies that there is no training dataset.

winglian · 2023-10-28T01:46:40Z

@dachenlian also don't set group_by_length: true when using sample packing

winglian · 2023-10-28T01:47:08Z

I am also facing the same error with latest master branch! Can we please fix it?

@akjindal53244 Can you please provide your YML config for us to be able to help you?

dachenlian · 2023-10-28T01:50:37Z

@dachenlian also don't set group_by_length: true when using sample packing

@winglian
May I ask why?

dachenlian · 2023-10-28T01:53:25Z

I set eval_steps: 1, val_set_size: 1, and eval_batch_size: 1 just to see if it would work.

setting val_set_size to 1 would never be exptected to work as this implies that there is no training dataset.

Yes, sorry that was a typo. My val_set_size was set to .5 with a training set size of 2 so there would be one validation sample

winglian · 2023-10-28T02:04:12Z

@dachenlian also don't set group_by_length: true when using sample packing

@winglian May I ask why?

they are incompatible features and using both I expect might lead to unexpected behavior

manishiitg · 2023-11-11T14:39:47Z

same issue happening for me as well.

Hi! I got the same error, also with eval_sample_packing: false. What seems to work to fix the evaluation part is to set sample_packing: false

only this seems to fix it

NanoCode012 · 2024-03-30T17:40:33Z

Does this occur with any of the example configs, so that we can reproduce it?

LeonardoEmili · 2024-04-05T10:02:55Z

Yes @NanoCode012 it happens as long as sample_packing=True and is triggered only during evaluation due to the way the models' forward function is monkey-patched (if sample_packing is active it will run both train and eval steps using the custom multi-packed forward).

However it looks like this has been fixed in main thanks to padding before calling torch.stack. I have been testing it over the past two days and training was never interrupted due to a failed evaluation step. Is there an ETA for the next release? I think many small issues have been fixed now and it would be great to have the changes delivered in a stable release.

cc @winglian

bursteratom · 2025-02-08T02:09:33Z

Closing this as testing with the latest main gives no error. This is the config yaml I used for reference:

base_model: Open-Orca/Mistral-7B-OpenOrca
base_model_config: Open-Orca/Mistral-7B-OpenOrca
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: true

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:
  - path: tatsu-lab/alpaca
    type: alpaca
dataset_prepared_path: last_run_prepared
val_set_size: 0.05
output_dir: ./outputs/out
torch_compile: true

sequence_len: 32768
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true
group_by_length: true

adapter: qlora

lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj

wandb_project: OA-Mistral-7B-OpenOrca
wandb_entity: axolotl-ai
wandb_watch:
wandb_run_id:
wandb_log_model:

gradient_accumulation_steps: 32
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.00005

# Augmentation techniques
# NEFT https://arxiv.org/abs/2310.05914, set this to a number (paper default is 5) to add noise to embeddings
# currently only supported on Llama and Mistral
noisy_embedding_alpha: 5

train_on_inputs: false
group_by_length: true
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint: 
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

save_steps: 20
eval_batch_size: 1
warmup_steps: 10
eval_steps: 20
eval_table_size: 5
eval_table_max_new_tokens: 128
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

dachenlian added the bug label Oct 18, 2023

NanoCode012 mentioned this issue Oct 22, 2023

Fix: eval table conflict with eval_sample_packing #769

Merged

bursteratom closed this as completed Feb 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

RuntimeError: stack expects each tensor to be equal size, but got... #740

RuntimeError: stack expects each tensor to be equal size, but got... #740

dachenlian commented Oct 18, 2023 •

edited

Loading

NanoCode012 commented Oct 18, 2023

dachenlian commented Oct 18, 2023

winglian commented Oct 19, 2023 •

edited

Loading

winglian commented Oct 19, 2023

winglian commented Oct 19, 2023

NanoCode012 commented Oct 22, 2023

dachenlian commented Oct 24, 2023

dachenlian commented Oct 24, 2023

winglian commented Oct 25, 2023

viantirreau commented Oct 25, 2023

akjindal53244 commented Oct 25, 2023 •

edited

Loading

winglian commented Oct 28, 2023

winglian commented Oct 28, 2023

winglian commented Oct 28, 2023

dachenlian commented Oct 28, 2023 •

edited

Loading

dachenlian commented Oct 28, 2023

winglian commented Oct 28, 2023

manishiitg commented Nov 11, 2023

NanoCode012 commented Mar 30, 2024

LeonardoEmili commented Apr 5, 2024

bursteratom commented Feb 8, 2025

RuntimeError: stack expects each tensor to be equal size, but got... #740

RuntimeError: stack expects each tensor to be equal size, but got... #740

Comments

dachenlian commented Oct 18, 2023 • edited Loading

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

NanoCode012 commented Oct 18, 2023

dachenlian commented Oct 18, 2023

winglian commented Oct 19, 2023 • edited Loading

winglian commented Oct 19, 2023

winglian commented Oct 19, 2023

NanoCode012 commented Oct 22, 2023

dachenlian commented Oct 24, 2023

dachenlian commented Oct 24, 2023

winglian commented Oct 25, 2023

viantirreau commented Oct 25, 2023

akjindal53244 commented Oct 25, 2023 • edited Loading

winglian commented Oct 28, 2023

winglian commented Oct 28, 2023

winglian commented Oct 28, 2023

dachenlian commented Oct 28, 2023 • edited Loading

dachenlian commented Oct 28, 2023

winglian commented Oct 28, 2023

manishiitg commented Nov 11, 2023

NanoCode012 commented Mar 30, 2024

LeonardoEmili commented Apr 5, 2024

bursteratom commented Feb 8, 2025

dachenlian commented Oct 18, 2023 •

edited

Loading

winglian commented Oct 19, 2023 •

edited

Loading

akjindal53244 commented Oct 25, 2023 •

edited

Loading

dachenlian commented Oct 28, 2023 •

edited

Loading