-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Compatibility for OLMo and OLMo2? #804
Comments
I would just like to express my enthusiastic endorsement of this proposal. I tried to the implementation a little bit, and thought I would share some of what that revealed. It seems to me that OLMo-1 and OLMo-2 follow Llama-2 quite closely. For example, in convert_hf_model_config() inside of loading_from_pretrained.py, something similar to
might make sense? Then we would probably need a new pretrained/weight_conversions/olmo2.py file. A nuance here seems to be that when loaded with the newest version HuggingFace transformers (at the time of writing), an Olmo2ForCausalLM object looks like
Whereas a LlamaForCausalLM object from 'meta-llama/Llama-2-7b-hf' looks like
So for example, OLMo-2 has (post_attention_layernorm), (post_feedforward_layernorm), at every layer, as opposed to (input_layernorm), (post_attention_layernorm) for Llama-2. It also has additional (rotary_emb), (q_norm), (k_norm) in every self_attn module which Llama-2 does not, while missing the model-wide (rotary_emb) that Llama-2 has. Finally, there's the vocabulary size of 100352 in OLMo-2 vs 32000 in Llama-2. Finally, Olmo2RMSNorm and LlamaRMSNorm seem to both be equivalent to T5LayerNorm. I'm tempted to give a PR a shot but I'm not sure if I know enough about TransformerLens. Is there anyone who could bridge the gap? |
Actually, it looks like #718 as well as https://github.com/jonasrohw/TransformerLens/tree/OLMo are closely related |
Proposal
It would be nice to include OLMo (1B and 7B) and their checkpoints as available compatible models for HookedTransformer.
Motivation
OLMo-1B would be a great model to do some mechanistic interpretability, especially as it is very open-source, allowing us to see relations between training data/processes, checkpoints and model performance. It should have fairly similar architecture to already compatible models. If it is possible to get it to run already, I would really appreciate a link to some information, as I've tried to look through the documentation myself in the meantime.
Pitch
Add OLMo-1B, -7B. Add OLMo2-7B and -13B. Add model checkpoints?
Checklist
The text was updated successfully, but these errors were encountered: