Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow specifying own Hugging Face tokenizer instance #717

Merged
merged 2 commits into from
Mar 1, 2024

Conversation

shawnz
Copy link
Contributor

@shawnz shawnz commented Feb 29, 2024

This gives the option of specifying your own PreTrainedTokenizer instance rather than specifying a model name and having outlines construct the tokenizer for you. This might be convenient for situations where you are already using the LLM in other parts of your application and only need outlines for a specific use case.

Example of use:

from transformers import AutoTokenizer, AutoModelForCausalLM
from outlines.models.transformers import TransformerTokenizer, Transformer

model_name = "hf-internal-testing/tiny-random-GPTJForCausalLM"
hf_tokenizer = AutoTokenizer.from_pretrained(model_name)
hf_model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = TransformerTokenizer(hf_tokenizer)
model = Transformer(hf_model, tokenizer)

As compared to before this change:

from transformers import AutoTokenizer, AutoModelForCausalLM
from outlines.models.transformers import TransformerTokenizer, Transformer

model_name = "hf-internal-testing/tiny-random-GPTJForCausalLM"
hf_tokenizer = AutoTokenizer.from_pretrained(model_name)
hf_model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = TransformerTokenizer(model_name) # Can't make use of hf_tokenizer here, even if
                                             # you already instantiated it for other reasons!
model = Transformer(hf_model, tokenizer)

This also corrects a type annotation on the outlines.models.transformers.Transformer class which incorrectly stated that the tokenizer argument was of type transformers.PreTrainedTokenizer when actually it should be of type outlines.models.transformers.TransformerTokenizer. In order to fix the annotation, the classes also had to be rearranged so that the name of the other class would be defined at the right time.

@rlouf rlouf added enhancement transformers Linked to the `transformers` integration labels Mar 1, 2024
@rlouf
Copy link
Member

rlouf commented Mar 1, 2024

This part of the interface has been bugging me for a while, thank you for helping!

@rlouf rlouf merged commit c4de2e0 into dottxt-ai:main Mar 1, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement transformers Linked to the `transformers` integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants