-
Notifications
You must be signed in to change notification settings - Fork 451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add custom resize_token_embeddings
method to OLMoForCausalLM
(#491)
#501
Conversation
This commit introduces the `resize_token_embeddings` function for the `OLMoForCausalLM` class. The function is based on the implementation from the Hugging Face Transformers library, with modifications to accommodate the specific requirements of the OLMo model. The `resize_token_embeddings` function allows resizing the input token embeddings matrix of the model when the number of tokens differs from the model's `embedding_size` configuration. It updates the `embedding_size` attribute in both the model configuration and the model itself, ensuring consistency after resizing. The function also handles tying the weights of the input and output embeddings if the model supports weight tying. Attribution: The implementation of this function is inspired by and adapted from the `resize_token_embeddings` function in the Hugging Face Transformers library (https://github.com/huggingface/transformers). The original code is licensed under the Apache License 2.0. Modifications: - Updated the function to use the `embedding_size` attribute instead of `vocab_size` to align with the OLMo model configuration. - Adjusted the docstring and comments to match the OLMo model's terminology and requirements.
@2015aroras @AkshitaB I'm not able to assign reviewers, so flagging for review as requested in #491 . Thanks! |
resize_token_embeddings
method to OLMoForCausalLM
(#491)resize_token_embeddings
method to OLMoForCausalLM
(#491)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vocab size captures how many different tokens a tokenizer can produce. Embedding size captures how many different integers the embedding layer can produce an embedding for. Theoretically speaking, the vocab and embedding sizes would be the same, since the embedding layer should have exactly 1 embedding per token and no extra embeddings.
In practice, not having an embedding size that is a multiple of 128 can can slight performance degradation (broadly speaking, having factors of 2 seems to be beneficial). Thus we use a bigger embedding size than needed, resulting in there being embeddings for some integers that are not valid tokens. Hopefully that makes sense.
The PR itself looks good! Just requesting some small changes.
def resize_token_embeddings( | ||
self, new_num_tokens: Optional[int] = None, pad_to_multiple_of: Optional[int] = None | ||
) -> torch.nn.Embedding: | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment seems to be the exact same as the one from the parent class. If it is, then it's not needed here (code editors and IDEs should be able to show the parent class's comment when one hovers over resize_token_embeddings
). Having the comment somewhat implies that we are defining what the method does, but it is the parent doing it.
return model_embeds | ||
|
||
# Update base model and current model config | ||
self.config.embedding_size = model_embeds.weight.shape[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a warning for when the new embedding size is less than the vocab size? Something like "Resizing token embeddings to size <size>, which is less than the vocab size <vocab size> of the tokenizer.
@2015aroras I added a warning if the new embedding size is < the vocab size. I could see an argument for making it an error for consistency with this, or omitting entirely since it will result in an error as soon as you try to do anything with it anyway. Though I think it also works as-is. I left the comment alone as it does change vocab size to embedding size (so not the same as base class), but please let me know if you'd like it amended regardless. Thanks! |
I leaned towards warning instead of error since the user is actively modifying the embedding size. I don't feel strongly about this and resizing embeddings is a relatively uncommon scenario, so any of the 3 options is ok.
I didn't notice the vocab size to embedding size change! I think it's ok to leave it as is, though you could describe how the method differs from |
clarifies difference between the updated method and the base class method
@2015aroras I updated the docstring noting the difference, and I'm happy leaving it as a warning. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just small things, otherwise looking good!
hf_olmo/modeling_olmo.py
Outdated
f"{self.config.vocab_size} defined in the model configuration. Make sure your tokenizer's vocabulary " | ||
"size is less than or equal to the new token embedding size." | ||
) | ||
warnings.warn(warning_message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We tend to use log = logging.getLogger(__name__)
and log.warn(...)
to log warnings now (e.g.
Line 45 in 417af0e
log.warning( |
hf_olmo/modeling_olmo.py
Outdated
self.config.embedding_size = model_embeds.weight.shape[0] | ||
self.model.config.embedding_size = model_embeds.weight.shape[0] | ||
|
||
# Check if the embedding size is less than the vocab size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Not enough indentation on this line
switching warning to log.warning and fixes comment indent
@2015aroras Changed the warning format and fixed the indentation. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just fix the code style using Ruff and then you can merge this.
@2015aroras done! |
@2015aroras Thanks for the reviews/feedback/etc! I don't have write access to merge this PR myself—from my side, both this PR and #503 are ready to go. Please let me know if there's anything else needed, otherwise feel free to merge when you have a chance. Appreciate your help! |
This PR addresses #491 by adding a custom
resize_token_embeddings
method to override the corresponding method from thePreTrainedModel
base class. The revised method resizesself.config.embedding_size
andself.model.config.embedding_size
rather than resizing thevocab_size
as done by the base class's method.Here's the result with the new method; this resulted in an error previously (see #491).
Resulting in:
Note that this does not update the
vocab_size
. The intended purpose of having separatevocab_size
andembedding_size
is not entirely clear to me yet, but it seems like the intent is to allow separatevocab_size
andembedding_size
, as long asembedding_size > vocab_size
. A few further options for addressing this issue more completely include:config.embedding_size
is defined, and updatevocab_size
if it isn't. This corresponds to how thewte
layer is defined in the first place (link)Please let me know if you have any questions or suggestions regarding this PR.