serve_reward_model goes down #351

AtsunoriFujita · 2024-10-18T16:16:08Z

Describe the bug

When we start serve_reward_model.py and run annotation, the server goes down during processing. It will crash on specific samples. These samples have a long context.

error.log

What we did

We built the source, but the issue has not been solved.
We also tried nvidia/Llama2-13B-SteerLM-RM, but ran into the same issue.
It runs without an issue on nvcr.io/nvidia/nemo:24.05.01 (critic speedup #219 is the main difference.).
The estimated processing time has also increased from 2 hours (nvcr.io/nvidia/nemo:24.05.01) to 7 hours (nvcr.io/nvidia/nemo:24.07).

Steps/Code to reproduce bug

export HYDRA_FULL_ERROR=1
export MODEL="/workspace/models/Llama3-70B-SteerLM-RM"

python /opt/NeMo-Aligner/examples/nlp/data/steerlm/preprocess_openassistant_data.py --output_directory=data/oasst

python /opt/NeMo-Aligner/examples/nlp/gpt/serve_reward_model.py \
    rm_model_file=${MODEL} \
    trainer.num_nodes=1 \
    trainer.devices=8 \
    ++model.tensor_model_parallel_size=8 \
    ++model.pipeline_model_parallel_size=1 \
    inference.inference_micro_batch_size=2 \
    inference.port=1424

python /opt/NeMo-Aligner/examples/nlp/data/steerlm/attribute_annotate.py \
      --input-file=data/oasst/train.jsonl \
      --output-file=data/oasst/train_labeled.jsonl \
      --port=1424

Before run attribute_annotate.py, you should apply #350

Expected behavior

The process is completed without the server going down.

Environment overview (please complete the following information)

DGX-C A100 * 8
nvcr.io/nvidia/nemo:24.07

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

OS version
PyTorch version
Python version

Additional context

Add any other context about the problem here.
Example: GPU model

The text was updated successfully, but these errors were encountered:

arthrod · 2024-10-29T12:18:45Z

Could you enforce the micro batches at 2?

AtsunoriFujita · 2024-11-01T01:57:42Z

I tried several patterns for the micro_batch_size, but they didn't solve the issue.

I attached the sample (from oasst) causing the error.
error_sample.txt.

No errors occur with nvcr.io/nvidia/nemo:24.05.01.

AtsunoriFujita · 2024-11-01T09:21:53Z

These samples are causing errors in oasst dataset.
error_samples.txt

AtsunoriFujita added the bug Something isn't working label Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

serve_reward_model goes down #351

serve_reward_model goes down #351

AtsunoriFujita commented Oct 18, 2024 •

edited

Loading

arthrod commented Oct 29, 2024 •

edited

Loading

AtsunoriFujita commented Nov 1, 2024 •

edited

Loading

AtsunoriFujita commented Nov 1, 2024

serve_reward_model goes down #351

serve_reward_model goes down #351

Comments

AtsunoriFujita commented Oct 18, 2024 • edited Loading

arthrod commented Oct 29, 2024 • edited Loading

AtsunoriFujita commented Nov 1, 2024 • edited Loading

AtsunoriFujita commented Nov 1, 2024

AtsunoriFujita commented Oct 18, 2024 •

edited

Loading

arthrod commented Oct 29, 2024 •

edited

Loading

AtsunoriFujita commented Nov 1, 2024 •

edited

Loading