Allow specifying own Hugging Face tokenizer instance #717

shawnz · 2024-02-29T18:27:30Z

This gives the option of specifying your own PreTrainedTokenizer instance rather than specifying a model name and having outlines construct the tokenizer for you. This might be convenient for situations where you are already using the LLM in other parts of your application and only need outlines for a specific use case.

Example of use:

from transformers import AutoTokenizer, AutoModelForCausalLM
from outlines.models.transformers import TransformerTokenizer, Transformer

model_name = "hf-internal-testing/tiny-random-GPTJForCausalLM"
hf_tokenizer = AutoTokenizer.from_pretrained(model_name)
hf_model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = TransformerTokenizer(hf_tokenizer)
model = Transformer(hf_model, tokenizer)

As compared to before this change:

from transformers import AutoTokenizer, AutoModelForCausalLM
from outlines.models.transformers import TransformerTokenizer, Transformer

model_name = "hf-internal-testing/tiny-random-GPTJForCausalLM"
hf_tokenizer = AutoTokenizer.from_pretrained(model_name)
hf_model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = TransformerTokenizer(model_name) # Can't make use of hf_tokenizer here, even if
                                             # you already instantiated it for other reasons!
model = Transformer(hf_model, tokenizer)

This also corrects a type annotation on the outlines.models.transformers.Transformer class which incorrectly stated that the tokenizer argument was of type transformers.PreTrainedTokenizer when actually it should be of type outlines.models.transformers.TransformerTokenizer. In order to fix the annotation, the classes also had to be rearranged so that the name of the other class would be defined at the right time.

rlouf · 2024-03-01T08:18:49Z

This part of the interface has been bugging me for a while, thank you for helping!

shawnz added 2 commits February 29, 2024 13:09

Allow specifying own HF tokenizer object

d99193b

Run commit hooks

89f2d52

rlouf added enhancement transformers Linked to the `transformers` integration labels Mar 1, 2024

rlouf merged commit c4de2e0 into dottxt-ai:main Mar 1, 2024
5 checks passed

rlouf mentioned this pull request Mar 1, 2024

Add ability to create an outlines object directly from Transformers model/tokenizer #709

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow specifying own Hugging Face tokenizer instance #717

Allow specifying own Hugging Face tokenizer instance #717

shawnz commented Feb 29, 2024 •

edited

Loading

rlouf commented Mar 1, 2024

Allow specifying own Hugging Face tokenizer instance #717

Allow specifying own Hugging Face tokenizer instance #717

Conversation

shawnz commented Feb 29, 2024 • edited Loading

rlouf commented Mar 1, 2024

shawnz commented Feb 29, 2024 •

edited

Loading