Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add custom resize_token_embeddings method to OLMoForCausalLM (#491) #501

Merged
merged 18 commits into from
Apr 2, 2024

Conversation

djliden
Copy link
Contributor

@djliden djliden commented Mar 13, 2024

This PR addresses #491 by adding a custom resize_token_embeddings method to override the corresponding method from the PreTrainedModel base class. The revised method resizes self.config.embedding_size and self.model.config.embedding_size rather than resizing the vocab_size as done by the base class's method.

Here's the result with the new method; this resulted in an error previously (see #491).

!pip install -e .
!pip install accelerate
dbutils.library.restartPython()

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import hf_olmo

model_ckpt = "allenai/OLMo-1B"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForCausalLM.from_pretrained(
    model_ckpt,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

print("Before resizing:")
print(f"Vocabulary size: {model.config.vocab_size}")
print(f"Embedding size: {model.config.embedding_size}")
print(f"Model embedding dimension: {model.get_input_embeddings().embedding_dim}")

special_tokens = ["<|im_start|>", "<|im_end|>"]
tokenizer.add_special_tokens({"additional_special_tokens": special_tokens})
model.resize_token_embeddings(len(tokenizer))

print("\nAfter resizing:")
print(f"Vocabulary size: {model.config.vocab_size}")
print(f"Embedding size: {model.config.embedding_size}")
print(f"Model embedding dimension: {model.get_input_embeddings().embedding_dim}")

input_text = "This is a dummy input text."
input_ids = tokenizer.encode(input_text, return_tensors="pt")
labels = input_ids.clone()

input_ids = input_ids.to(model.device)
labels = labels.to(model.device)

with torch.no_grad():
    outputs = model(input_ids, labels=labels)

print("\nModel run completed without errors.")

Resulting in:

Before resizing:
Vocabulary size: 50280
Embedding size: 50304
Model embedding dimension: 2048

After resizing:
Vocabulary size: 50280
Embedding size: 50282
Model embedding dimension: 2048

Model run completed without errors.

Note that this does not update the vocab_size. The intended purpose of having separate vocab_size and embedding_size is not entirely clear to me yet, but it seems like the intent is to allow separate vocab_size and embedding_size, as long as embedding_size > vocab_size. A few further options for addressing this issue more completely include:

  • update the new method to check whether config.embedding_size is defined, and update vocab_size if it isn't. This corresponds to how the wte layer is defined in the first place (link)
  • add a separate function for resizing the vocabulary
  • treat vocab size vs. embedding size discrepancies as a separate issue and review the current change as-is

Please let me know if you have any questions or suggestions regarding this PR.

djliden and others added 4 commits March 13, 2024 14:36
This commit introduces the `resize_token_embeddings` function for the
`OLMoForCausalLM` class. The function is based on the implementation
from the Hugging Face Transformers library, with modifications to
accommodate the specific requirements of the OLMo model.

The `resize_token_embeddings` function allows resizing the input token
embeddings matrix of the model when the number of tokens differs from
the model's `embedding_size` configuration. It updates the `embedding_size`
attribute in both the model configuration and the model itself, ensuring
consistency after resizing.

The function also handles tying the weights of the input and output
embeddings if the model supports weight tying.

Attribution:
The implementation of this function is inspired by and adapted from
the `resize_token_embeddings` function in the Hugging Face Transformers
library (https://github.com/huggingface/transformers). The original code
is licensed under the Apache License 2.0.

Modifications:
- Updated the function to use the `embedding_size` attribute instead of
  `vocab_size` to align with the OLMo model configuration.
- Adjusted the docstring and comments to match the OLMo model's terminology
  and requirements.
@djliden
Copy link
Contributor Author

djliden commented Mar 13, 2024

@2015aroras @AkshitaB I'm not able to assign reviewers, so flagging for review as requested in #491 . Thanks!

@djliden djliden changed the title ## Add custom resize_token_embeddings method to OLMoForCausalLM (#491) Add custom resize_token_embeddings method to OLMoForCausalLM (#491) Mar 13, 2024
Copy link
Collaborator

@2015aroras 2015aroras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vocab size captures how many different tokens a tokenizer can produce. Embedding size captures how many different integers the embedding layer can produce an embedding for. Theoretically speaking, the vocab and embedding sizes would be the same, since the embedding layer should have exactly 1 embedding per token and no extra embeddings.

In practice, not having an embedding size that is a multiple of 128 can can slight performance degradation (broadly speaking, having factors of 2 seems to be beneficial). Thus we use a bigger embedding size than needed, resulting in there being embeddings for some integers that are not valid tokens. Hopefully that makes sense.

The PR itself looks good! Just requesting some small changes.

def resize_token_embeddings(
self, new_num_tokens: Optional[int] = None, pad_to_multiple_of: Optional[int] = None
) -> torch.nn.Embedding:
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems to be the exact same as the one from the parent class. If it is, then it's not needed here (code editors and IDEs should be able to show the parent class's comment when one hovers over resize_token_embeddings). Having the comment somewhat implies that we are defining what the method does, but it is the parent doing it.

return model_embeds

# Update base model and current model config
self.config.embedding_size = model_embeds.weight.shape[0]
Copy link
Collaborator

@2015aroras 2015aroras Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a warning for when the new embedding size is less than the vocab size? Something like "Resizing token embeddings to size <size>, which is less than the vocab size <vocab size> of the tokenizer.

@djliden
Copy link
Contributor Author

djliden commented Mar 15, 2024

@2015aroras I added a warning if the new embedding size is < the vocab size. I could see an argument for making it an error for consistency with this, or omitting entirely since it will result in an error as soon as you try to do anything with it anyway. Though I think it also works as-is.

I left the comment alone as it does change vocab size to embedding size (so not the same as base class), but please let me know if you'd like it amended regardless. Thanks!

@djliden djliden requested a review from 2015aroras March 15, 2024 15:23
@2015aroras
Copy link
Collaborator

@2015aroras I added a warning if the new embedding size is < the vocab size. I could see an argument for making it an error for consistency with this, or omitting entirely since it will result in an error as soon as you try to do anything with it anyway. Though I think it also works as-is.

I leaned towards warning instead of error since the user is actively modifying the embedding size. I don't feel strongly about this and resizing embeddings is a relatively uncommon scenario, so any of the 3 options is ok.

I left the comment alone as it does change vocab size to embedding size (so not the same as base class), but please let me know if you'd like it amended regardless. Thanks!

I didn't notice the vocab size to embedding size change! I think it's ok to leave it as is, though you could describe how the method differs from PreTrainedModel.resize_token_embeddings.

djliden and others added 2 commits March 20, 2024 14:13
clarifies difference between the updated method and the base class method
@djliden
Copy link
Contributor Author

djliden commented Mar 20, 2024

@2015aroras I updated the docstring noting the difference, and I'm happy leaving it as a warning. Thanks!

Copy link
Collaborator

@2015aroras 2015aroras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just small things, otherwise looking good!

f"{self.config.vocab_size} defined in the model configuration. Make sure your tokenizer's vocabulary "
"size is less than or equal to the new token embedding size."
)
warnings.warn(warning_message)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tend to use log = logging.getLogger(__name__) and log.warn(...) to log warnings now (e.g.

log.warning(
).

self.config.embedding_size = model_embeds.weight.shape[0]
self.model.config.embedding_size = model_embeds.weight.shape[0]

# Check if the embedding size is less than the vocab size
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Not enough indentation on this line

switching warning to log.warning and fixes comment indent
@djliden
Copy link
Contributor Author

djliden commented Mar 21, 2024

@2015aroras Changed the warning format and fixed the indentation. Thanks!

@djliden djliden requested a review from 2015aroras March 21, 2024 22:23
Copy link
Collaborator

@2015aroras 2015aroras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just fix the code style using Ruff and then you can merge this.

@djliden
Copy link
Contributor Author

djliden commented Mar 22, 2024

@2015aroras done!

@djliden
Copy link
Contributor Author

djliden commented Mar 23, 2024

@2015aroras Thanks for the reviews/feedback/etc! I don't have write access to merge this PR myself—from my side, both this PR and #503 are ready to go. Please let me know if there's anything else needed, otherwise feel free to merge when you have a chance. Appreciate your help!

@2015aroras 2015aroras merged commit 0de5fdc into allenai:main Apr 2, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants