-
-
Notifications
You must be signed in to change notification settings - Fork 969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: stack expects each tensor to be equal size, but got... #740
Comments
Could you try to pull latest? Seems duplicate of #737 |
I closed that one after I noticed there was a newer commit, assuming it would have fixed it. Turns it out I'm still running into the same problem after pulling the latest. I did notice the other things that were mentioned in a045db0, such as the loss being lower and less VRAM usage. |
@dachenlian can you try this with |
@dachenlian also, how large/small is your dataset? |
@NanoCode012 I think we may have to add a validation that if sample_packing is true and the eval table is enabled, that particular configuration is likely invalid. |
Hello @dachenlian ! May I ask if the above comments solved it for you? We've just added validation for this in #769 for future clarification. |
Hi! Sorry, my machine was busy training a model so I couldn't get back to you right away. My dataset has about 26000 training examples, with the max being 32K tokens. I tried a quick debug by using a smaller 8K training set of 2 sample and an evaluation set of 2 sample and running evaluation after 1 step. Strange thing is it worked in this case. base_model: Open-Orca/Mistral-7B-OpenOrca
base_model_config: Open-Orca/Mistral-7B-OpenOrca
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: true
load_in_8bit: false
load_in_4bit: true
strict: false
datasets:
- path: json
data_files: data/sharegpt/oa-debug-8k-axolotl-sharegpt.json
type: sharegpt
dataset_prepared_path: data/sharegpt/debug-last_run_prepared
val_set_size: .5
output_dir: ./debug-oa-mistral-openorca-ckpt
torch_compile: true
sequence_len: 32768
sample_packing: true
eval_sample_packing: true
pad_to_sequence_len: true
group_by_length: true
adapter: qlora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
- gate_proj
- down_proj
- up_proj
- q_proj
- v_proj
- k_proj
- o_proj
wandb_project: OA-Mistral-7B-OpenOrca
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model:
gradient_accumulation_steps: 32
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.00005
# Augmentation techniques
# NEFT https://arxiv.org/abs/2310.05914, set this to a number (paper default is 5) to add noise to embeddings
# currently only supported on Llama and Mistral
noisy_embedding_alpha: 5
train_on_inputs: false
group_by_length: true
bf16: true
fp16: false
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
auto_resume_from_checkpoint: true
eval_batch_size: 1
eval_steps: 1
save_steps: 20
warmup_steps: 10
eval_table_size: 5
eval_table_max_new_tokens: 128
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
bos_token: "<s>"
eos_token: "</s>"
unk_token: "<unk>" After switching File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/richard/github/axolotl/src/axolotl/cli/train.py", line 51, in <module>
fire.Fire(do_cli)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/richard/github/axolotl/src/axolotl/cli/train.py", line 47, in do_cli
train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
File "/home/richard/github/axolotl/src/axolotl/train.py", line 118, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 1591, in train
return inner_training_loop(
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 1984, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 2328, in _maybe_log_save_evaluate
metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 3066, in evaluate
output = eval_loop(
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 3255, in evaluation_loop
loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 3474, in prediction_step
loss, outputs = self.compute_loss(model, inputs, return_outputs=True)
File "/home/richard/github/axolotl/src/axolotl/utils/trainer.py", line 311, in compute_loss
return super().compute_loss(model, inputs, return_outputs=return_outputs)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 2801, in compute_loss
outputs = model(**inputs)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/accelerate/utils/operations.py", line 636, in forward
return model_forward(*args, **kwargs)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/accelerate/utils/operations.py", line 624, in __call__
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/peft/peft_model.py", line 965, in forward
return self.base_model(
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 106, in forward
return self.model.forward(*args, **kwargs)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 1045, in forward
outputs = self.model(
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/richard/github/axolotl/src/axolotl/monkeypatch/mistral_attn_hijack_flash.py", line 536, in mistral_model_forward
layer_outputs = decoder_layer(
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/richard/github/axolotl/src/axolotl/monkeypatch/mistral_attn_hijack_flash.py", line 614, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/richard/github/axolotl/src/axolotl/monkeypatch/mistral_attn_hijack_flash.py", line 226, in flashattn_forward
qkv_unpad, cu_seqlens_q, max_seqlen_q, _, output_pad_fn = generate_qkv(
File "/home/richard/github/axolotl/src/axolotl/monkeypatch/mistral_attn_hijack_flash.py", line 337, in generate_qkv
q_unpad, indices_q, cu_seqlens_q, max_seqlen_q = unpad_input(
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/flash_attn/bert_padding.py", line 118, in unpad_input
index_first_axis(rearrange(hidden_states, "b s ... -> (b s) ..."), indices),
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/flash_attn/bert_padding.py", line 17, in forward
return torch.gather(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7325.08 GiB (GPU 0; 47.99 GiB total capacity; 15.01 GiB already allocated; 28.03 GiB free; 17.51 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF After switching to longer training samples (token lengths: [22574, 16895]) and eval samples ([8070, 17368]) I still get an OutOfMemoryError with With [2023-10-24 09:18:05,167] [INFO] [axolotl.utils.dataloader.generate_batches:187] [PID:1059841] [RANK:0] 83b97b859aa5f81b2f0f86ba2a675efaf515ad2d5e2b8652cf2de7e1c2267350
[2023-10-24 09:18:05,167] [INFO] [axolotl.utils.dataloader._len_est:264] [PID:1059841] [RANK:0] packing_efficiency_estimate: 0.61 total_num_tokens per device: 25352
[2023-10-24 09:18:13,490] [INFO] [axolotl.monkeypatch.mistral._prepare_decoder_attention_mask:113] [PID:1059841] [RANK:0] skipping sliding window mask, not broadcastable with attention mask
[2023-10-24 09:18:22,842] [INFO] [axolotl.monkeypatch.mistral._prepare_decoder_attention_mask:113] [PID:1059841] [RANK:0] skipping sliding window mask, not broadcastable with attention mask
/opt/conda/conda-bld/pytorch_1682343967769/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [65540,0,0], thread: [64,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343967769/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [65540,0,0], thread: [65,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343967769/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [65540,0,0], thread: [66,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343967769/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [65540,0,0], thread: [67,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed. This goes on for a while... 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343967769/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [65569,0,0], thread: [28,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343967769/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [65569,0,0], thread: [29,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343967769/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [65569,0,0], thread: [30,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343967769/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [65569,0,0], thread: [31,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
0%| | 0/1 [00:18<?, ?it/s]
Traceback (most recent call last):
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/richard/github/axolotl/src/axolotl/cli/train.py", line 51, in <module>
fire.Fire(do_cli)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/richard/github/axolotl/src/axolotl/cli/train.py", line 47, in do_cli
train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
File "/home/richard/github/axolotl/src/axolotl/train.py", line 118, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 1591, in train
return inner_training_loop(
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 1984, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 2328, in _maybe_log_save_evaluate
metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 3094, in evaluate
self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, output.metrics)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer_callback.py", line 388, in on_evaluate
return self.call_event("on_evaluate", args, state, control, metrics=metrics)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer_callback.py", line 406, in call_event
result = getattr(callback, event)(
File "/home/richard/github/axolotl/src/axolotl/utils/callbacks.py", line 512, in on_evaluate
log_table_from_dataloader("Eval", eval_dataloader)
File "/home/richard/github/axolotl/src/axolotl/utils/callbacks.py", line 472, in log_table_from_dataloader
predictions = trainer.model.generate(
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/peft/peft_model.py", line 1022, in generate
outputs = self.base_model.generate(**kwargs)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1606, in generate
return self.greedy_search(
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2454, in greedy_search
outputs = self(
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 1045, in forward
outputs = self.model(
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/richard/github/axolotl/src/axolotl/monkeypatch/mistral_attn_hijack_flash.py", line 536, in mistral_model_forward
layer_outputs = decoder_layer(
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/richard/github/axolotl/src/axolotl/monkeypatch/mistral_attn_hijack_flash.py", line 614, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/richard/github/axolotl/src/axolotl/monkeypatch/mistral_attn_hijack_flash.py", line 270, in flashattn_forward
) = generate_qkv(
File "/home/richard/github/axolotl/src/axolotl/monkeypatch/mistral_attn_hijack_flash.py", line 362, in generate_qkv
v_unpad, _, _, _ = unpad_input(v, key_padding_mask)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/flash_attn/bert_padding.py", line 109, in unpad_input
indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. |
I pulled the latest commit 20aa4b5 and tried training again. This is my config: base_model: ehartford/dolphin-2.1-mistral-7b
base_model_config: ehartford/dolphin-2.1-mistral-7b
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: true
load_in_8bit: false
load_in_4bit: true
strict: false
datasets:
- path: json
data_files: data/sharegpt/oa-train-32k-axolotl-sharegpt.json
type: sharegpt
conversation: chatml
dataset_prepared_path: data/sharegpt/32k-last_run_prepared
val_set_size: 0.001
output_dir: ./oa-mistral-dolphin-ckpt
torch_compile: false
adapter: lora
sequence_len: 32768
sample_packing: true
pad_to_sequence_len: true
lora_r: 64
lora_alpha: 128
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
- gate_proj
- down_proj
- up_proj
- q_proj
- v_proj
- k_proj
- o_proj
wandb_project: OA-Mistral-7B-Dolphin
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model:
gradient_accumulation_steps: 32
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.00004
# Augmentation techniques
# NEFT https://arxiv.org/abs/2310.05914, set this to a number (paper default is 5) to add noise to embeddings
# currently only supported on Llama and Mistral
noisy_embedding_alpha: 5
train_on_inputs: false
group_by_length: true
bf16: true
fp16: false
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
flash_attn_fuse_qkv: false # Whether to fuse QKV into a single operation
flash_attn_fuse_mlp: false # Whether to fuse part of the MLP into a single operation
# Whether to use scaled-dot-product attention
# https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html
sdp_attention: false
auto_resume_from_checkpoint: true
eval_batch_size: 1
eval_steps: 1
eval_sample_packing: false
save_steps: 20
warmup_steps: 10
eval_table_size: 5
eval_table_max_new_tokens: 128
debug: false
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
bos_token: "<s>"
eos_token: "</s>"
unk_token: "<unk>" I set This is my output: {'loss': 1.6353, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.0}
0%| | 1/17729 [00:24<121:00:23, 24.57s/it]Traceback (most recent call last):
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/richard/github/axolotl/src/axolotl/cli/train.py", line 51, in <module>
fire.Fire(do_cli)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/richard/github/axolotl/src/axolotl/cli/train.py", line 47, in do_cli
train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
File "/home/richard/github/axolotl/src/axolotl/train.py", line 116, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 1591, in train
return inner_training_loop(
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 1984, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 2328, in _maybe_log_save_evaluate
metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 3062, in evaluate
eval_dataloader = self.get_eval_dataloader(eval_dataset)
File "/home/richard/github/axolotl/src/axolotl/core/trainer_builder.py", line 199, in get_eval_dataloader
MultipackDistributedDataloader(
File "/home/richard/github/axolotl/src/axolotl/utils/dataloader.py", line 156, in __init__
dataset.data.column("position_ids")
File "/home/richard/miniforge3/envs/axolotl/lib/python3.10/site-packages/datasets/table.py", line 390, in column
return self.table.column(*args, **kwargs)
File "pyarrow/table.pxi", line 1611, in pyarrow.lib._Tabular.column
File "pyarrow/table.pxi", line 1547, in pyarrow.lib._Tabular._ensure_integer_index
KeyError: 'Field "position_ids" does not exist in schema' |
can you add |
Hi! I got the same error, also with My config.ymlbase_model: PY007/TinyLlama-1.1B-intermediate-step-480k-1T
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
is_llama_derived_model: true
load_in_8bit: true
load_in_4bit: false
strict: false
datasets:
- path: json
type: alpaca
data_files: ./tiny/alpaca_soft_and_manual_dataset.jsonl
dataset_prepared_path:
val_set_size: 0.1
output_dir: ./out
sequence_len: 8192
sample_packing: false # usually true, seems to fix the evaluation part
pad_to_sequence_len: true
adapter: lora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
wandb_project: tiny
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model: end # "checkpoint" to log model to wandb Artifacts every `save_steps` or "end" to log only at the end of training
gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 25
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002
train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
warmup_steps: 10
eval_steps: 2 # just to fail earlier, usually 20
eval_table_size: 5
eval_table_max_new_tokens: 500
eval_sample_packing: false
save_steps: 20
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
bos_token: "<s>"
eos_token: "</s>"
unk_token: "<unk>" |
I am also facing the same error with latest master branch! Can we please fix it? |
setting val_set_size to 1 would never be exptected to work as this implies that there is no training dataset. |
@dachenlian also don't set |
@akjindal53244 Can you please provide your YML config for us to be able to help you? |
@winglian |
Yes, sorry that was a typo. My val_set_size was set to .5 with a training set size of 2 so there would be one validation sample |
they are incompatible features and using both I expect might lead to unexpected behavior |
same issue happening for me as well.
only this seems to fix it |
Does this occur with any of the example configs, so that we can reproduce it? |
Yes @NanoCode012 it happens as long as However it looks like this has been fixed in main thanks to padding before calling torch.stack. I have been testing it over the past two days and training was never interrupted due to a failed evaluation step. Is there an ETA for the next release? I think many small issues have been fixed now and it would be great to have the changes delivered in a stable release. cc @winglian |
Closing this as testing with the latest main gives no error. This is the config yaml I used for reference:
|
Please check that this issue hasn't been reported before.
Expected Behavior
Training should continue after evaluation + logging to wandb
Current behaviour
I'm running Windows 11 in WSL2 and I get the following traceback:
Steps to reproduce
Enable wandb logging and val_set_size? I am unsure, but the most recent commit was supposed to fix this error?
Config yaml
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.10.12
axolotl branch-commit
main/a045db0
Acknowledgements
The text was updated successfully, but these errors were encountered: