Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MCore slower than NeMo native implementation #9524

Closed
janEbert opened this issue Jun 24, 2024 · 5 comments
Closed

MCore slower than NeMo native implementation #9524

janEbert opened this issue Jun 24, 2024 · 5 comments
Assignees
Labels
bug Something isn't working stale

Comments

@janEbert
Copy link
Contributor

janEbert commented Jun 24, 2024

Describe the bug

I've benchmarked both settings of model.mcore_gpt in an FSDP setting on the two most recent NVIDIA GPU architectures and found model.mcore_gpt=False to be consistently faster (although only slightly on the H100). Note that the A100 vs. H100 numbers are not meant to be comparable; they are run on different systems using a different number of CPU workers, but they do use the same software environment.

| mcore | GPU  | avg secs per iter |
|-------+------+-------------------|
| True  | H100 |          0.898143 |
| False | H100 |          0.868319 |
| True  | A100 |            2.6544 |
| False | A100 |           1.71934 |

Steps/Code to reproduce bug

Please list minimal steps or code snippet for us to be able to reproduce the bug.

mcore_val=True  # True or False

PER_SPLIT_NUM_WORKERS=5  # 5 for A100, 9 for H100

TRAIN_DATA_PREFIX=my-tiny-c4-gpt2-tok/train_text_document
EVAL_DATA_PREFIX=my-tiny-c4-gpt2-tok/val_text_document
TEST_DATA_PREFIX="$EVAL_DATA_PREFIX"

TOKENIZER_VOCAB_FILE=gpt2-vocab.json
TOKENIZER_MERGE_FILE=gpt2-merges.txt

python -u \
    examples/nlp/language_modeling/megatron_gpt_pretraining.py  \
    --config-path=examples/nlp/language_modeling/conf \
    --config-name=megatron_llama_config \
    trainer.devices=4 \
    trainer.num_nodes=2 \
    trainer.max_steps=100 \
    trainer.log_every_n_steps=1 \
    trainer.val_check_interval=100 \
    +trainer.num_sanity_val_steps=0 \
    trainer.precision=bf16-mixed \
    model.micro_batch_size=1 \
    model.global_batch_size=8 \
    model.mcore_gpt="$mcore_val" \
    +model.fsdp=True \
    +model.fsdp_sharding_strategy=full \
    +model.fsdp_grad_reduce_dtype=32 \
    +model.fsdp_sharded_checkpoint=True \
    model.tensor_model_parallel_size=1 \
    model.pipeline_model_parallel_size=1 \
    model.sequence_parallel=False \
    +model.use_flash_attention=True \
    model.tokenizer.library=megatron \
    model.tokenizer.type=GPT2BPETokenizer \
    model.tokenizer.model=null \
    model.tokenizer.vocab_file="$TOKENIZER_VOCAB_FILE" \
    model.tokenizer.merge_file="$TOKENIZER_MERGE_FILE" \
    +model.data.data_prefix=\{train:\[1.0,"$TRAIN_DATA_PREFIX"\],validation:\[1.0,"$EVAL_DATA_PREFIX"\],test:\[1.0,"$TEST_DATA_PREFIX"\]\} \
    model.data.num_workers="$PER_SPLIT_NUM_WORKERS" \
    exp_manager.name="megatron_llama_my-tiny-c4-gpt2-tok_mcore-\${model.mcore_gpt}"

Expected behavior

Since the model.mcore_gpt=False version has been deprecated, I would expect the model.mcore_gpt=True version to be at least on par with performance. The numbers on the A100 are substantially worse for the model.mcore_gpt=True version.

Environment overview (please complete the following information)

  • Environment location: [Apptainer, using NVIDIA PyTorch 24.05 Docker container with no modifications inside the container, but a venv outside the container]
  • Method of NeMo install: [pip install from source]. git clone https://github.com/NVIDIA/NeMo.git && cd NeMo && git checkout dda92f00de2785de46983d7aa4ac77cbb1b353ec && python -m pip install .[all]
  • Method of Megatron-LM install: [pip install from source]. git clone https://github.com/NVIDIA/Megatron-LM.git && cd Megatron-LM && git checkout a645f89671be698612170539f2089dc15db66a80 && python -m pip install .

Additional context

  • The A100 is the 40 GB version.
  • The H100 is the 64 GB version.
  • Throughput benchmarking used 10 warmup steps and the average of 91 samples.
@janEbert janEbert added the bug Something isn't working label Jun 24, 2024
Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Aug 15, 2024
@janEbert
Copy link
Contributor Author

Please take a look.

@github-actions github-actions bot removed the stale label Aug 17, 2024
@ericharper
Copy link
Collaborator

Could you try using NeMo 2.0 + FSDP and comparing? We're not planning to support NeMo 1.0 + FSDP.

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Sep 23, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

3 participants