You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Great work!! Really appreciate bringing so many details together. As a beginner, I've some basic questions, so appreciate your patience.
If I understand it correctly, a LM can be evaluated
Either as a classifier depending on how they were trained - if a normal reward model is used, then we will be using run_rm.py, and if DPO is used, we will be using run_dpo.py
Or generate the responses from this LM, use another LM (GPT4 or so as a judge) and evaluate the generative responses - run_generative.py
Now, if my understanding is correct, Llama models (eg: meta-llama/Llama-2-7b-chat-hf) is trained using a reward model and Mistral models (eg: mistralai/Mistral-7B-Instruct-v0.2) is trained using DPO.
Q: So my first question is, run_rm.py can be used for Llama models and run_dpo.py can be used for Mistral models.
Q: If instead of evaluating in classifier fashion, I plan to evaluate using the generative approach which will truly tell the capability of LM, I believe the below commands should work, without providing any chat templates!
I'm aware that the main results in the paper (except for Table 8) are of classifier based reward models; and as such the final results when using run_generative.py will be lower as generative based approach is more challenging.
The text was updated successfully, but these errors were encountered:
The big distinction is on the type of LM it is. You can think about this with HuggingFace Transformers abstractions. Most reward models are trained with AutoModelForSequenceClassification which adds a value head to output one logit -- these are run with run_rm (along with similar models with slightly different architecture). A standard generative model is AutoModelForCausalLM which is run with run_generative.
run_dpo is largely deprecated, but it is designed to run a DPO trained model as an implicit RM. These models can also be run with run_generative, but the usages as a RM is different (generative == llm as a judge).
Hi team,
Great work!! Really appreciate bringing so many details together. As a beginner, I've some basic questions, so appreciate your patience.
If I understand it correctly, a LM can be evaluated
run_rm.py
, and if DPO is used, we will be usingrun_dpo.py
run_generative.py
Now, if my understanding is correct, Llama models (eg:
meta-llama/Llama-2-7b-chat-hf
) is trained using a reward model and Mistral models (eg:mistralai/Mistral-7B-Instruct-v0.2
) is trained using DPO.Q: So my first question is,
run_rm.py
can be used for Llama models andrun_dpo.py
can be used for Mistral models.Q: If instead of evaluating in classifier fashion, I plan to evaluate using the generative approach which will truly tell the capability of LM, I believe the below commands should work, without providing any chat templates!
I'm aware that the main results in the paper (except for Table 8) are of classifier based reward models; and as such the final results when using
run_generative.py
will be lower as generative based approach is more challenging.The text was updated successfully, but these errors were encountered: