Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic questions with using the repo for Llama and Mistral models #209

Open
NamburiSrinath opened this issue Dec 11, 2024 · 1 comment
Open

Comments

@NamburiSrinath
Copy link

Hi team,

Great work!! Really appreciate bringing so many details together. As a beginner, I've some basic questions, so appreciate your patience.

If I understand it correctly, a LM can be evaluated

  • Either as a classifier depending on how they were trained - if a normal reward model is used, then we will be using run_rm.py, and if DPO is used, we will be using run_dpo.py
  • Or generate the responses from this LM, use another LM (GPT4 or so as a judge) and evaluate the generative responses - run_generative.py

Now, if my understanding is correct, Llama models (eg: meta-llama/Llama-2-7b-chat-hf) is trained using a reward model and Mistral models (eg: mistralai/Mistral-7B-Instruct-v0.2) is trained using DPO.

Q: So my first question is, run_rm.py can be used for Llama models and run_dpo.py can be used for Mistral models.

Q: If instead of evaluating in classifier fashion, I plan to evaluate using the generative approach which will truly tell the capability of LM, I believe the below commands should work, without providing any chat templates!

python scripts/run_generative.py --model=meta-llama/Llama-2-7b-chat-hf --force_local
python scripts/run_generative.py --model=mistralai/Mistral-7B-Instruct-v0.2 --force_local

I'm aware that the main results in the paper (except for Table 8) are of classifier based reward models; and as such the final results when using run_generative.py will be lower as generative based approach is more challenging.

@natolambert
Copy link
Collaborator

The big distinction is on the type of LM it is. You can think about this with HuggingFace Transformers abstractions. Most reward models are trained with AutoModelForSequenceClassification which adds a value head to output one logit -- these are run with run_rm (along with similar models with slightly different architecture). A standard generative model is AutoModelForCausalLM which is run with run_generative.

run_dpo is largely deprecated, but it is designed to run a DPO trained model as an implicit RM. These models can also be run with run_generative, but the usages as a RM is different (generative == llm as a judge).

Hope that helps @NamburiSrinath

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants