[BUG]: Llama3.1-70B-instruct save model #6108

cingtiye · 2024-11-01T04:42:28Z

Is there an existing issue for this bug?

I have searched the existing issues

🐛 Describe the bug

I trained reward model based on Llama3.1-70B-instruct in 48 H100 (3d tp=8, pp=1, ).
When execute booster.save_model(model, os.path.join(save_dir, "modeling"), shard=True), the size of model.embed_tokens.weight saved is [16064, 8192] rather than [128256, 8192]. However, the size of other weight are correct.

Please HELP ME!
Thank you!

Environment

transformes 4.44.1
colosssalai 0.4.5
flash-attn 2.6.3

The text was updated successfully, but these errors were encountered:

Issues-translate-bot · 2024-11-01T04:42:38Z

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

Title: [BUG]: Llama3.1-70B-instruct save model

wangbluo · 2024-11-04T04:02:36Z

Hi, is your final model save correct? As you have used tp = 8, the weight of embed_tokens is after sharded.

cingtiye · 2024-11-04T04:05:51Z

Hi, is your final model save correct? As you have used tp = 8, the weight of embed_tokens is after sharded.

Final model is not correct!
When I use tp=4, the weight of embed_tokens is correct and not sharded! However, when I use tp=8, the size of model.embed_tokens.weight is only 16064.

cingtiye · 2024-11-04T06:32:06Z

Hi, is your final model save correct? As you have used tp = 8, the weight of embed_tokens is after sharded.

Hi, Could you please tell me how to solve this issue?

wangbluo · 2024-11-04T06:36:08Z

I couldn't reproduce your bug locally. I suggest upgrading ColossalAI to the latest main branch, then perform a source build with pip install -e .

cingtiye · 2024-11-04T06:39:54Z

I couldn't reproduce your bug locally. I suggest upgrading ColossalAI to the latest main branch, then perform a source build with pip install -e .

you can reproduce it by run https://github.com/hpcaitech/ColossalAI/blob/main/applications/ColossalChat/examples/training_scripts/train_rm.py with tp=8 and llama3.1-8b-instruct.

cingtiye · 2024-11-06T04:23:25Z

I couldn't reproduce your bug locally. I suggest upgrading ColossalAI to the latest main branch, then perform a source build with pip install -e .

tp=8 pp=1 sp=1 Llama3.1-8b-instruct

Traceback (most recent call last):
  File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/examples/training_scripts/train_llama_rm.py", line 403, in <module>
    train(args)
  File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/examples/training_scripts/train_llama_rm.py", line 326, in train
    trainer.fit(
  File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/coati/trainer/base.py", line 67, in fit
    self._train(epoch)
  File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/coati/trainer/rm.py", line 381, in _train
    save_checkpoint(
  File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/coati/utils/ckpt_io.py", line 61, in save_checkpoint
    booster.save_optimizer(optimizer, os.path.join(save_dir, "optimizer"), shard=True)
  File "/usr/local/lib/python3.10/site-packages/colossalai/booster/booster.py", line 374, in save_optimizer
    self.checkpoint_io.save_optimizer(optimizer, checkpoint, shard, gather_dtensor, prefix, size_per_shard)
  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/checkpoint_io_base.py", line 197, in save_optimizer
    self.save_sharded_optimizer(optimizer, checkpoint, gather_dtensor, prefix, size_per_shard)

  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/hybrid_parallel_checkpoint_io.py", line 447, in save_sharded_optimizer
    total_size = save_state_dict_shards(
  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/utils.py", line 243, in save_state_dict_shards
    for idx, shard_pair in enumerate(sharded_state_dict):
  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/hybrid_parallel_checkpoint_io.py", line 155, in _optimizer_sharder
    state_ = HybridParallelCheckpointIO.gather_from_sharded_optimizer_state(
  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/hybrid_parallel_checkpoint_io.py", line 873, in gather_from_sharded_optimizer_state
    partition_dim = search_tp_partition_dim(current_shape, original_shape, tp_size)
  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/utils.py", line 116, in search_tp_partition_dim
    original_shape[partition_dim] == tp_size * current_shape[partition_dim]
AssertionError: The parameter isn't evenly distributed among tensor parallel group:                 shape before sharding torch.Size([128256, 4096]), shape after sharding torch.Size([16064, 4096])

wangbluo · 2024-11-06T08:16:51Z

Thank you for your reply, we are refactoring this part, you can try early versions like 0.3.1, I have run Llama3 models before. After refactoring finished, we will tested and remind you if needed.

cingtiye · 2024-11-09T03:39:02Z

Thank you for your reply, we are refactoring this part, you can try early versions like 0.3.1, I have run Llama3 models before. After refactoring finished, we will tested and remind you if needed.

Has there been a break in the issue?

flybird11111 · 2024-11-11T10:03:54Z

I couldn't reproduce your bug locally. I suggest upgrading ColossalAI to the latest main branch, then perform a source build with pip install -e .

tp=8 pp=1 sp=1 Llama3.1-8b-instruct

Traceback (most recent call last):
  File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/examples/training_scripts/train_llama_rm.py", line 403, in <module>
    train(args)
  File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/examples/training_scripts/train_llama_rm.py", line 326, in train
    trainer.fit(
  File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/coati/trainer/base.py", line 67, in fit
    self._train(epoch)
  File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/coati/trainer/rm.py", line 381, in _train
    save_checkpoint(
  File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/coati/utils/ckpt_io.py", line 61, in save_checkpoint
    booster.save_optimizer(optimizer, os.path.join(save_dir, "optimizer"), shard=True)
  File "/usr/local/lib/python3.10/site-packages/colossalai/booster/booster.py", line 374, in save_optimizer
    self.checkpoint_io.save_optimizer(optimizer, checkpoint, shard, gather_dtensor, prefix, size_per_shard)
  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/checkpoint_io_base.py", line 197, in save_optimizer
    self.save_sharded_optimizer(optimizer, checkpoint, gather_dtensor, prefix, size_per_shard)

  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/hybrid_parallel_checkpoint_io.py", line 447, in save_sharded_optimizer
    total_size = save_state_dict_shards(
  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/utils.py", line 243, in save_state_dict_shards
    for idx, shard_pair in enumerate(sharded_state_dict):
  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/hybrid_parallel_checkpoint_io.py", line 155, in _optimizer_sharder
    state_ = HybridParallelCheckpointIO.gather_from_sharded_optimizer_state(
  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/hybrid_parallel_checkpoint_io.py", line 873, in gather_from_sharded_optimizer_state
    partition_dim = search_tp_partition_dim(current_shape, original_shape, tp_size)
  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/utils.py", line 116, in search_tp_partition_dim
    original_shape[partition_dim] == tp_size * current_shape[partition_dim]
AssertionError: The parameter isn't evenly distributed among tensor parallel group:                 shape before sharding torch.Size([128256, 4096]), shape after sharding torch.Size([16064, 4096])

Hi, you can try modifying it according to this PR and retry.

flybird11111 · 2024-11-11T10:04:12Z

I couldn't reproduce your bug locally. I suggest upgrading ColossalAI to the latest main branch, then perform a source build with pip install -e .

tp=8 pp=1 sp=1 Llama3.1-8b-instruct

Traceback (most recent call last):
  File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/examples/training_scripts/train_llama_rm.py", line 403, in <module>
    train(args)
  File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/examples/training_scripts/train_llama_rm.py", line 326, in train
    trainer.fit(
  File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/coati/trainer/base.py", line 67, in fit
    self._train(epoch)
  File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/coati/trainer/rm.py", line 381, in _train
    save_checkpoint(
  File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/coati/utils/ckpt_io.py", line 61, in save_checkpoint
    booster.save_optimizer(optimizer, os.path.join(save_dir, "optimizer"), shard=True)
  File "/usr/local/lib/python3.10/site-packages/colossalai/booster/booster.py", line 374, in save_optimizer
    self.checkpoint_io.save_optimizer(optimizer, checkpoint, shard, gather_dtensor, prefix, size_per_shard)
  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/checkpoint_io_base.py", line 197, in save_optimizer
    self.save_sharded_optimizer(optimizer, checkpoint, gather_dtensor, prefix, size_per_shard)

  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/hybrid_parallel_checkpoint_io.py", line 447, in save_sharded_optimizer
    total_size = save_state_dict_shards(
  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/utils.py", line 243, in save_state_dict_shards
    for idx, shard_pair in enumerate(sharded_state_dict):
  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/hybrid_parallel_checkpoint_io.py", line 155, in _optimizer_sharder
    state_ = HybridParallelCheckpointIO.gather_from_sharded_optimizer_state(
  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/hybrid_parallel_checkpoint_io.py", line 873, in gather_from_sharded_optimizer_state
    partition_dim = search_tp_partition_dim(current_shape, original_shape, tp_size)
  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/utils.py", line 116, in search_tp_partition_dim
    original_shape[partition_dim] == tp_size * current_shape[partition_dim]
AssertionError: The parameter isn't evenly distributed among tensor parallel group:                 shape before sharding torch.Size([128256, 4096]), shape after sharding torch.Size([16064, 4096])

#6124

Hi, you can try modifying it according to this PR and retry.

cingtiye · 2024-11-11T11:53:23Z

I couldn't reproduce your bug locally. I suggest upgrading ColossalAI to the latest main branch, then perform a source build with pip install -e .

tp=8 pp=1 sp=1 Llama3.1-8b-instruct

Traceback (most recent call last):
  File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/examples/training_scripts/train_llama_rm.py", line 403, in <module>
    train(args)
  File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/examples/training_scripts/train_llama_rm.py", line 326, in train
    trainer.fit(
  File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/coati/trainer/base.py", line 67, in fit
    self._train(epoch)
  File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/coati/trainer/rm.py", line 381, in _train
    save_checkpoint(
  File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/coati/utils/ckpt_io.py", line 61, in save_checkpoint
    booster.save_optimizer(optimizer, os.path.join(save_dir, "optimizer"), shard=True)
  File "/usr/local/lib/python3.10/site-packages/colossalai/booster/booster.py", line 374, in save_optimizer
    self.checkpoint_io.save_optimizer(optimizer, checkpoint, shard, gather_dtensor, prefix, size_per_shard)
  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/checkpoint_io_base.py", line 197, in save_optimizer
    self.save_sharded_optimizer(optimizer, checkpoint, gather_dtensor, prefix, size_per_shard)

  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/hybrid_parallel_checkpoint_io.py", line 447, in save_sharded_optimizer
    total_size = save_state_dict_shards(
  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/utils.py", line 243, in save_state_dict_shards
    for idx, shard_pair in enumerate(sharded_state_dict):
  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/hybrid_parallel_checkpoint_io.py", line 155, in _optimizer_sharder
    state_ = HybridParallelCheckpointIO.gather_from_sharded_optimizer_state(
  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/hybrid_parallel_checkpoint_io.py", line 873, in gather_from_sharded_optimizer_state
    partition_dim = search_tp_partition_dim(current_shape, original_shape, tp_size)
  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/utils.py", line 116, in search_tp_partition_dim
    original_shape[partition_dim] == tp_size * current_shape[partition_dim]
AssertionError: The parameter isn't evenly distributed among tensor parallel group:                 shape before sharding torch.Size([128256, 4096]), shape after sharding torch.Size([16064, 4096])

#6124

Hi, you can try modifying it according to this PR and retry.

The key is that the embedding dimension should not be sharded into 16064 (128256/8=16032).

cingtiye · 2024-11-12T10:39:20Z

I couldn't reproduce your bug locally. I suggest upgrading ColossalAI to the latest main branch, then perform a source build with pip install -e .

tp=8 pp=1 sp=1 Llama3.1-8b-instruct

Traceback (most recent call last):
  File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/examples/training_scripts/train_llama_rm.py", line 403, in <module>
    train(args)
  File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/examples/training_scripts/train_llama_rm.py", line 326, in train
    trainer.fit(
  File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/coati/trainer/base.py", line 67, in fit
    self._train(epoch)
  File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/coati/trainer/rm.py", line 381, in _train
    save_checkpoint(
  File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/coati/utils/ckpt_io.py", line 61, in save_checkpoint
    booster.save_optimizer(optimizer, os.path.join(save_dir, "optimizer"), shard=True)
  File "/usr/local/lib/python3.10/site-packages/colossalai/booster/booster.py", line 374, in save_optimizer
    self.checkpoint_io.save_optimizer(optimizer, checkpoint, shard, gather_dtensor, prefix, size_per_shard)
  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/checkpoint_io_base.py", line 197, in save_optimizer
    self.save_sharded_optimizer(optimizer, checkpoint, gather_dtensor, prefix, size_per_shard)

  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/hybrid_parallel_checkpoint_io.py", line 447, in save_sharded_optimizer
    total_size = save_state_dict_shards(
  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/utils.py", line 243, in save_state_dict_shards
    for idx, shard_pair in enumerate(sharded_state_dict):
  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/hybrid_parallel_checkpoint_io.py", line 155, in _optimizer_sharder
    state_ = HybridParallelCheckpointIO.gather_from_sharded_optimizer_state(
  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/hybrid_parallel_checkpoint_io.py", line 873, in gather_from_sharded_optimizer_state
    partition_dim = search_tp_partition_dim(current_shape, original_shape, tp_size)
  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/utils.py", line 116, in search_tp_partition_dim
    original_shape[partition_dim] == tp_size * current_shape[partition_dim]
AssertionError: The parameter isn't evenly distributed among tensor parallel group:                 shape before sharding torch.Size([128256, 4096]), shape after sharding torch.Size([16064, 4096])

#6124

Hi, you can try modifying it according to this PR and retry.

What’s the progress like?

cingtiye · 2024-11-13T07:10:34Z

I couldn't reproduce your bug locally. I suggest upgrading ColossalAI to the latest main branch, then perform a source build with pip install -e .

tp=8 pp=1 sp=1 Llama3.1-8b-instruct

Traceback (most recent call last):
  File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/examples/training_scripts/train_llama_rm.py", line 403, in <module>
    train(args)
  File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/examples/training_scripts/train_llama_rm.py", line 326, in train
    trainer.fit(
  File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/coati/trainer/base.py", line 67, in fit
    self._train(epoch)
  File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/coati/trainer/rm.py", line 381, in _train
    save_checkpoint(
  File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/coati/utils/ckpt_io.py", line 61, in save_checkpoint
    booster.save_optimizer(optimizer, os.path.join(save_dir, "optimizer"), shard=True)
  File "/usr/local/lib/python3.10/site-packages/colossalai/booster/booster.py", line 374, in save_optimizer
    self.checkpoint_io.save_optimizer(optimizer, checkpoint, shard, gather_dtensor, prefix, size_per_shard)
  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/checkpoint_io_base.py", line 197, in save_optimizer
    self.save_sharded_optimizer(optimizer, checkpoint, gather_dtensor, prefix, size_per_shard)

  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/hybrid_parallel_checkpoint_io.py", line 447, in save_sharded_optimizer
    total_size = save_state_dict_shards(
  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/utils.py", line 243, in save_state_dict_shards
    for idx, shard_pair in enumerate(sharded_state_dict):
  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/hybrid_parallel_checkpoint_io.py", line 155, in _optimizer_sharder
    state_ = HybridParallelCheckpointIO.gather_from_sharded_optimizer_state(
  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/hybrid_parallel_checkpoint_io.py", line 873, in gather_from_sharded_optimizer_state
    partition_dim = search_tp_partition_dim(current_shape, original_shape, tp_size)
  File "/usr/local/lib/python3.10/site-packages/colossalai/checkpoint_io/utils.py", line 116, in search_tp_partition_dim
    original_shape[partition_dim] == tp_size * current_shape[partition_dim]
AssertionError: The parameter isn't evenly distributed among tensor parallel group:                 shape before sharding torch.Size([128256, 4096]), shape after sharding torch.Size([16064, 4096])

#6124

Hi, you can try modifying it according to this PR and retry.

What’s the progress like?

Set make_vocal_size_divisible_by=32, it is ok

plugin = HybridParallelPlugin(
            tp_size=args.tp,
            pp_size=args.pp,
            sp_size=args.sp,
            sequence_parallelism_mode=args.sp_mode,
            zero_stage=args.zero_stage,
            enable_flash_attention=args.use_flash_attn,
            enable_sequence_parallelism=args.enable_sequence_parallelism,
            cpu_offload=True if args.zero_stage >= 1 and args.zero_cpu_offload else False,
            parallel_output=False,
            max_norm=args.grad_clip,
            precision=args.mixed_precision,
            # set make_vocal_size_divisible_by=32
            make_vocal_size_divisible_by=32,
            custom_policy=get_autopolicy(model.model),
        )

cingtiye added the bug Something isn't working label Nov 1, 2024

cingtiye closed this as completed Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Llama3.1-70B-instruct save model #6108

[BUG]: Llama3.1-70B-instruct save model #6108

cingtiye commented Nov 1, 2024

Issues-translate-bot commented Nov 1, 2024

wangbluo commented Nov 4, 2024 •

edited

Loading

cingtiye commented Nov 4, 2024

cingtiye commented Nov 4, 2024

wangbluo commented Nov 4, 2024

cingtiye commented Nov 4, 2024

cingtiye commented Nov 6, 2024

wangbluo commented Nov 6, 2024

cingtiye commented Nov 9, 2024

flybird11111 commented Nov 11, 2024

flybird11111 commented Nov 11, 2024

cingtiye commented Nov 11, 2024

cingtiye commented Nov 12, 2024

cingtiye commented Nov 13, 2024

[BUG]: Llama3.1-70B-instruct save model #6108

[BUG]: Llama3.1-70B-instruct save model #6108

Comments

cingtiye commented Nov 1, 2024

Is there an existing issue for this bug?

🐛 Describe the bug

Environment

Issues-translate-bot commented Nov 1, 2024

wangbluo commented Nov 4, 2024 • edited Loading

cingtiye commented Nov 4, 2024

cingtiye commented Nov 4, 2024

wangbluo commented Nov 4, 2024

cingtiye commented Nov 4, 2024

cingtiye commented Nov 6, 2024

wangbluo commented Nov 6, 2024

cingtiye commented Nov 9, 2024

flybird11111 commented Nov 11, 2024

flybird11111 commented Nov 11, 2024

cingtiye commented Nov 11, 2024

cingtiye commented Nov 12, 2024

cingtiye commented Nov 13, 2024

wangbluo commented Nov 4, 2024 •

edited

Loading