Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

serve_reward_model goes down #351

Open
AtsunoriFujita opened this issue Oct 18, 2024 · 3 comments
Open

serve_reward_model goes down #351

AtsunoriFujita opened this issue Oct 18, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@AtsunoriFujita
Copy link

AtsunoriFujita commented Oct 18, 2024

Describe the bug

When we start serve_reward_model.py and run annotation, the server goes down during processing. It will crash on specific samples. These samples have a long context.

error.log

What we did

  • We built the source, but the issue has not been solved.
  • We also tried nvidia/Llama2-13B-SteerLM-RM, but ran into the same issue.
  • It runs without an issue on nvcr.io/nvidia/nemo:24.05.01 (critic speedup #219 is the main difference.).
  • The estimated processing time has also increased from 2 hours (nvcr.io/nvidia/nemo:24.05.01) to 7 hours (nvcr.io/nvidia/nemo:24.07).

Steps/Code to reproduce bug

export HYDRA_FULL_ERROR=1
export MODEL="/workspace/models/Llama3-70B-SteerLM-RM"

python /opt/NeMo-Aligner/examples/nlp/data/steerlm/preprocess_openassistant_data.py --output_directory=data/oasst

python /opt/NeMo-Aligner/examples/nlp/gpt/serve_reward_model.py \
    rm_model_file=${MODEL} \
    trainer.num_nodes=1 \
    trainer.devices=8 \
    ++model.tensor_model_parallel_size=8 \
    ++model.pipeline_model_parallel_size=1 \
    inference.inference_micro_batch_size=2 \
    inference.port=1424

python /opt/NeMo-Aligner/examples/nlp/data/steerlm/attribute_annotate.py \
      --input-file=data/oasst/train.jsonl \
      --output-file=data/oasst/train_labeled.jsonl \
      --port=1424

Before run attribute_annotate.py, you should apply #350

Expected behavior

The process is completed without the server going down.

Environment overview (please complete the following information)

  • DGX-C A100 * 8
  • nvcr.io/nvidia/nemo:24.07

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

  • OS version
  • PyTorch version
  • Python version

Additional context

Add any other context about the problem here.
Example: GPU model

@AtsunoriFujita AtsunoriFujita added the bug Something isn't working label Oct 18, 2024
@arthrod
Copy link

arthrod commented Oct 29, 2024

Could you enforce the micro batches at 2?

@AtsunoriFujita
Copy link
Author

AtsunoriFujita commented Nov 1, 2024

I tried several patterns for the micro_batch_size, but they didn't solve the issue.

I attached the sample (from oasst) causing the error.
error_sample.txt.

No errors occur with nvcr.io/nvidia/nemo:24.05.01.

@AtsunoriFujita
Copy link
Author

These samples are causing errors in oasst dataset.
error_samples.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants