MCore slower than NeMo native implementation #9524

janEbert · 2024-06-24T11:32:56Z

Describe the bug

I've benchmarked both settings of model.mcore_gpt in an FSDP setting on the two most recent NVIDIA GPU architectures and found model.mcore_gpt=False to be consistently faster (although only slightly on the H100). Note that the A100 vs. H100 numbers are not meant to be comparable; they are run on different systems using a different number of CPU workers, but they do use the same software environment.

| mcore | GPU  | avg secs per iter |
|-------+------+-------------------|
| True  | H100 |          0.898143 |
| False | H100 |          0.868319 |
| True  | A100 |            2.6544 |
| False | A100 |           1.71934 |

Steps/Code to reproduce bug

Please list minimal steps or code snippet for us to be able to reproduce the bug.

mcore_val=True  # True or False

PER_SPLIT_NUM_WORKERS=5  # 5 for A100, 9 for H100

TRAIN_DATA_PREFIX=my-tiny-c4-gpt2-tok/train_text_document
EVAL_DATA_PREFIX=my-tiny-c4-gpt2-tok/val_text_document
TEST_DATA_PREFIX="$EVAL_DATA_PREFIX"

TOKENIZER_VOCAB_FILE=gpt2-vocab.json
TOKENIZER_MERGE_FILE=gpt2-merges.txt

python -u \
    examples/nlp/language_modeling/megatron_gpt_pretraining.py  \
    --config-path=examples/nlp/language_modeling/conf \
    --config-name=megatron_llama_config \
    trainer.devices=4 \
    trainer.num_nodes=2 \
    trainer.max_steps=100 \
    trainer.log_every_n_steps=1 \
    trainer.val_check_interval=100 \
    +trainer.num_sanity_val_steps=0 \
    trainer.precision=bf16-mixed \
    model.micro_batch_size=1 \
    model.global_batch_size=8 \
    model.mcore_gpt="$mcore_val" \
    +model.fsdp=True \
    +model.fsdp_sharding_strategy=full \
    +model.fsdp_grad_reduce_dtype=32 \
    +model.fsdp_sharded_checkpoint=True \
    model.tensor_model_parallel_size=1 \
    model.pipeline_model_parallel_size=1 \
    model.sequence_parallel=False \
    +model.use_flash_attention=True \
    model.tokenizer.library=megatron \
    model.tokenizer.type=GPT2BPETokenizer \
    model.tokenizer.model=null \
    model.tokenizer.vocab_file="$TOKENIZER_VOCAB_FILE" \
    model.tokenizer.merge_file="$TOKENIZER_MERGE_FILE" \
    +model.data.data_prefix=\{train:\[1.0,"$TRAIN_DATA_PREFIX"\],validation:\[1.0,"$EVAL_DATA_PREFIX"\],test:\[1.0,"$TEST_DATA_PREFIX"\]\} \
    model.data.num_workers="$PER_SPLIT_NUM_WORKERS" \
    exp_manager.name="megatron_llama_my-tiny-c4-gpt2-tok_mcore-\${model.mcore_gpt}"

Expected behavior

Since the model.mcore_gpt=False version has been deprecated, I would expect the model.mcore_gpt=True version to be at least on par with performance. The numbers on the A100 are substantially worse for the model.mcore_gpt=True version.

Environment overview (please complete the following information)

Environment location: [Apptainer, using NVIDIA PyTorch 24.05 Docker container with no modifications inside the container, but a venv outside the container]
Method of NeMo install: [pip install from source]. git clone https://github.com/NVIDIA/NeMo.git && cd NeMo && git checkout dda92f00de2785de46983d7aa4ac77cbb1b353ec && python -m pip install .[all]
Method of Megatron-LM install: [pip install from source]. git clone https://github.com/NVIDIA/Megatron-LM.git && cd Megatron-LM && git checkout a645f89671be698612170539f2089dc15db66a80 && python -m pip install .

Additional context

The A100 is the 40 GB version.
The H100 is the 64 GB version.
Throughput benchmarking used 10 warmup steps and the average of 91 samples.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-08-15T01:46:52Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

janEbert · 2024-08-16T08:41:24Z

Please take a look.

ericharper · 2024-08-23T23:11:36Z

Could you try using NeMo 2.0 + FSDP and comparing? We're not planning to support NeMo 1.0 + FSDP.

github-actions · 2024-09-23T01:58:28Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions · 2024-09-30T02:00:34Z

This issue was closed because it has been inactive for 7 days since being marked as stale.

janEbert added the bug Something isn't working label Jun 24, 2024

ericharper assigned blahBlahhhJ Jul 15, 2024

github-actions bot added the stale label Aug 15, 2024

github-actions bot removed the stale label Aug 17, 2024

github-actions bot added the stale label Sep 23, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MCore slower than NeMo native implementation #9524

MCore slower than NeMo native implementation #9524

janEbert commented Jun 24, 2024 •

edited

Loading

github-actions bot commented Aug 15, 2024

janEbert commented Aug 16, 2024

ericharper commented Aug 23, 2024

github-actions bot commented Sep 23, 2024

github-actions bot commented Sep 30, 2024

MCore slower than NeMo native implementation #9524

MCore slower than NeMo native implementation #9524

Comments

janEbert commented Jun 24, 2024 • edited Loading

github-actions bot commented Aug 15, 2024

janEbert commented Aug 16, 2024

ericharper commented Aug 23, 2024

github-actions bot commented Sep 23, 2024

github-actions bot commented Sep 30, 2024

janEbert commented Jun 24, 2024 •

edited

Loading