Add custom `resize_token_embeddings` method to `OLMoForCausalLM` (#491) #501

djliden · 2024-03-13T20:38:11Z

This PR addresses #491 by adding a custom resize_token_embeddings method to override the corresponding method from the PreTrainedModel base class. The revised method resizes self.config.embedding_size and self.model.config.embedding_size rather than resizing the vocab_size as done by the base class's method.

Here's the result with the new method; this resulted in an error previously (see #491).

!pip install -e .
!pip install accelerate
dbutils.library.restartPython()

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import hf_olmo

model_ckpt = "allenai/OLMo-1B"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForCausalLM.from_pretrained(
    model_ckpt,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

print("Before resizing:")
print(f"Vocabulary size: {model.config.vocab_size}")
print(f"Embedding size: {model.config.embedding_size}")
print(f"Model embedding dimension: {model.get_input_embeddings().embedding_dim}")

special_tokens = ["<|im_start|>", "<|im_end|>"]
tokenizer.add_special_tokens({"additional_special_tokens": special_tokens})
model.resize_token_embeddings(len(tokenizer))

print("\nAfter resizing:")
print(f"Vocabulary size: {model.config.vocab_size}")
print(f"Embedding size: {model.config.embedding_size}")
print(f"Model embedding dimension: {model.get_input_embeddings().embedding_dim}")

input_text = "This is a dummy input text."
input_ids = tokenizer.encode(input_text, return_tensors="pt")
labels = input_ids.clone()

input_ids = input_ids.to(model.device)
labels = labels.to(model.device)

with torch.no_grad():
    outputs = model(input_ids, labels=labels)

print("\nModel run completed without errors.")

Resulting in:

Before resizing:
Vocabulary size: 50280
Embedding size: 50304
Model embedding dimension: 2048

After resizing:
Vocabulary size: 50280
Embedding size: 50282
Model embedding dimension: 2048

Model run completed without errors.

Note that this does not update the vocab_size. The intended purpose of having separate vocab_size and embedding_size is not entirely clear to me yet, but it seems like the intent is to allow separate vocab_size and embedding_size, as long as embedding_size > vocab_size. A few further options for addressing this issue more completely include:

update the new method to check whether config.embedding_size is defined, and update vocab_size if it isn't. This corresponds to how the wte layer is defined in the first place (link)
add a separate function for resizing the vocabulary
treat vocab size vs. embedding size discrepancies as a separate issue and review the current change as-is

Please let me know if you have any questions or suggestions regarding this PR.

This commit introduces the `resize_token_embeddings` function for the `OLMoForCausalLM` class. The function is based on the implementation from the Hugging Face Transformers library, with modifications to accommodate the specific requirements of the OLMo model. The `resize_token_embeddings` function allows resizing the input token embeddings matrix of the model when the number of tokens differs from the model's `embedding_size` configuration. It updates the `embedding_size` attribute in both the model configuration and the model itself, ensuring consistency after resizing. The function also handles tying the weights of the input and output embeddings if the model supports weight tying. Attribution: The implementation of this function is inspired by and adapted from the `resize_token_embeddings` function in the Hugging Face Transformers library (https://github.com/huggingface/transformers). The original code is licensed under the Apache License 2.0. Modifications: - Updated the function to use the `embedding_size` attribute instead of `vocab_size` to align with the OLMo model configuration. - Adjusted the docstring and comments to match the OLMo model's terminology and requirements.

djliden · 2024-03-13T20:41:18Z

@2015aroras @AkshitaB I'm not able to assign reviewers, so flagging for review as requested in #491 . Thanks!

2015aroras

Vocab size captures how many different tokens a tokenizer can produce. Embedding size captures how many different integers the embedding layer can produce an embedding for. Theoretically speaking, the vocab and embedding sizes would be the same, since the embedding layer should have exactly 1 embedding per token and no extra embeddings.

In practice, not having an embedding size that is a multiple of 128 can can slight performance degradation (broadly speaking, having factors of 2 seems to be beneficial). Thus we use a bigger embedding size than needed, resulting in there being embeddings for some integers that are not valid tokens. Hopefully that makes sense.

The PR itself looks good! Just requesting some small changes.

2015aroras · 2024-03-13T22:14:16Z

hf_olmo/modeling_olmo.py

+    def resize_token_embeddings(
+        self, new_num_tokens: Optional[int] = None, pad_to_multiple_of: Optional[int] = None
+    ) -> torch.nn.Embedding:
+        """


This comment seems to be the exact same as the one from the parent class. If it is, then it's not needed here (code editors and IDEs should be able to show the parent class's comment when one hovers over resize_token_embeddings). Having the comment somewhat implies that we are defining what the method does, but it is the parent doing it.

2015aroras · 2024-03-13T22:19:35Z

hf_olmo/modeling_olmo.py

+            return model_embeds
+
+        # Update base model and current model config
+        self.config.embedding_size = model_embeds.weight.shape[0]


Could you add a warning for when the new embedding size is less than the vocab size? Something like "Resizing token embeddings to size <size>, which is less than the vocab size <vocab size> of the tokenizer.

djliden · 2024-03-15T15:23:47Z

@2015aroras I added a warning if the new embedding size is < the vocab size. I could see an argument for making it an error for consistency with this, or omitting entirely since it will result in an error as soon as you try to do anything with it anyway. Though I think it also works as-is.

I left the comment alone as it does change vocab size to embedding size (so not the same as base class), but please let me know if you'd like it amended regardless. Thanks!

2015aroras · 2024-03-18T23:56:31Z

@2015aroras I added a warning if the new embedding size is < the vocab size. I could see an argument for making it an error for consistency with this, or omitting entirely since it will result in an error as soon as you try to do anything with it anyway. Though I think it also works as-is.

I leaned towards warning instead of error since the user is actively modifying the embedding size. I don't feel strongly about this and resizing embeddings is a relatively uncommon scenario, so any of the 3 options is ok.

I left the comment alone as it does change vocab size to embedding size (so not the same as base class), but please let me know if you'd like it amended regardless. Thanks!

I didn't notice the vocab size to embedding size change! I think it's ok to leave it as is, though you could describe how the method differs from PreTrainedModel.resize_token_embeddings.

clarifies difference between the updated method and the base class method

djliden · 2024-03-20T19:24:04Z

@2015aroras I updated the docstring noting the difference, and I'm happy leaving it as a warning. Thanks!

2015aroras

Just small things, otherwise looking good!

2015aroras · 2024-03-18T23:47:07Z

hf_olmo/modeling_olmo.py

+                f"{self.config.vocab_size} defined in the model configuration. Make sure your tokenizer's vocabulary "
+                "size is less than or equal to the new token embedding size."
+            )
+            warnings.warn(warning_message)


We tend to use log = logging.getLogger(__name__) and log.warn(...) to log warnings now (e.g.

OLMo/scripts/train.py

Line 45 in 417af0e

log.warning(

).

2015aroras · 2024-03-18T23:47:42Z

hf_olmo/modeling_olmo.py

+        self.config.embedding_size = model_embeds.weight.shape[0]
+        self.model.config.embedding_size = model_embeds.weight.shape[0]
+
+    # Check if the embedding size is less than the vocab size


nit: Not enough indentation on this line

switching warning to log.warning and fixes comment indent

djliden · 2024-03-21T22:23:34Z

@2015aroras Changed the warning format and fixed the indentation. Thanks!

2015aroras

LGTM! Just fix the code style using Ruff and then you can merge this.

djliden · 2024-03-22T17:29:13Z

@2015aroras done!

djliden · 2024-03-23T15:43:27Z

@2015aroras Thanks for the reviews/feedback/etc! I don't have write access to merge this PR myself—from my side, both this PR and #503 are ready to go. Please let me know if there's anything else needed, otherwise feel free to merge when you have a chance. Appreciate your help!

djliden and others added 4 commits March 13, 2024 14:36

Merge branch 'allenai:main' into dl/fix-embedding-resize

4e4ff1f

fixes indentation

cf774dc

updates changelog

17149cf

Merge branch 'main' into dl/fix-embedding-resize

8f8e8c2

djliden changed the title ~~## Add custom resize_token_embeddings method to OLMoForCausalLM (#491)~~ Add custom resize_token_embeddings method to OLMoForCausalLM (#491) Mar 13, 2024

2015aroras requested review from AkshitaB and 2015aroras March 13, 2024 21:07

2015aroras reviewed Mar 13, 2024

View reviewed changes

adds warning if embedding size < vocab size

36140c4

djliden requested a review from 2015aroras March 15, 2024 15:23

djliden and others added 3 commits March 15, 2024 11:25

fixes import ordering

1dedce9

Merge branch 'main' into dl/fix-embedding-resize

2da225e

Merge branch 'main' into dl/fix-embedding-resize

ba72c66

djliden and others added 2 commits March 20, 2024 14:13

Merge branch 'allenai:main' into dl/fix-embedding-resize

04065f0

updates comment

f395a55

clarifies difference between the updated method and the base class method

Merge branch 'main' into dl/fix-embedding-resize

ee9e9ff

2015aroras reviewed Mar 21, 2024

View reviewed changes

minor fixes

e1ec4b3

switching warning to log.warning and fixes comment indent

djliden requested a review from 2015aroras March 21, 2024 22:23

2015aroras approved these changes Mar 22, 2024

View reviewed changes

formats with ruff

1595326

Merge branch 'main' into dl/fix-embedding-resize

e186f5b

2015aroras added 3 commits March 28, 2024 13:55

Merge branch 'main' into dl/fix-embedding-resize

aa5687d

Merge branch 'main' into dl/fix-embedding-resize

3cdfcde

Merge branch 'main' into dl/fix-embedding-resize

81d30d4

2015aroras merged commit 0de5fdc into allenai:main Apr 2, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add custom `resize_token_embeddings` method to `OLMoForCausalLM` (#491) #501

Add custom `resize_token_embeddings` method to `OLMoForCausalLM` (#491) #501

djliden commented Mar 13, 2024 •

edited

Loading

djliden commented Mar 13, 2024

2015aroras left a comment •

edited

Loading

2015aroras Mar 13, 2024

2015aroras Mar 13, 2024 •

edited

Loading

djliden commented Mar 15, 2024 •

edited

Loading

2015aroras commented Mar 18, 2024

djliden commented Mar 20, 2024

2015aroras left a comment

2015aroras Mar 18, 2024

2015aroras Mar 18, 2024

djliden commented Mar 21, 2024

2015aroras left a comment

djliden commented Mar 22, 2024

djliden commented Mar 23, 2024 •

edited

Loading

Add custom resize_token_embeddings method to OLMoForCausalLM (#491) #501

Add custom resize_token_embeddings method to OLMoForCausalLM (#491) #501

Conversation

djliden commented Mar 13, 2024 • edited Loading

djliden commented Mar 13, 2024

2015aroras left a comment • edited Loading

Choose a reason for hiding this comment

2015aroras Mar 13, 2024

Choose a reason for hiding this comment

2015aroras Mar 13, 2024 • edited Loading

Choose a reason for hiding this comment

djliden commented Mar 15, 2024 • edited Loading

2015aroras commented Mar 18, 2024

djliden commented Mar 20, 2024

2015aroras left a comment

Choose a reason for hiding this comment

2015aroras Mar 18, 2024

Choose a reason for hiding this comment

2015aroras Mar 18, 2024

Choose a reason for hiding this comment

djliden commented Mar 21, 2024

2015aroras left a comment

Choose a reason for hiding this comment

djliden commented Mar 22, 2024

djliden commented Mar 23, 2024 • edited Loading

Add custom `resize_token_embeddings` method to `OLMoForCausalLM` (#491) #501

Add custom `resize_token_embeddings` method to `OLMoForCausalLM` (#491) #501

djliden commented Mar 13, 2024 •

edited

Loading

2015aroras left a comment •

edited

Loading

2015aroras Mar 13, 2024 •

edited

Loading

djliden commented Mar 15, 2024 •

edited

Loading

djliden commented Mar 23, 2024 •

edited

Loading