Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert between text and tokens easily #6

Open
deontologician opened this issue Dec 7, 2020 · 1 comment
Open

Convert between text and tokens easily #6

deontologician opened this issue Dec 7, 2020 · 1 comment

Comments

@deontologician
Copy link
Owner

Currently a couple of the apis talk in tokens, which is inconvenient. It would be nice if you could translate text into tokens and vise-versa easily.

The rust_tokenizer crate has a function called from_file that allows instantiating the GPT2 tokenizer given a couple pretrained tokenizer files. These files are available from huggingface's website here:

There is also an example in rust_bert of constructing a gpt2 tokenizer. Ideally the tokenizer would be built lazily so users of the library don't need to pay for it unless they need the features.

Where to use it

It looks most like this will be useful with the logit_bias feature, since the api requires you send the token number, rather than actual strings. Since the example code is in python, this is a bit of a barrier to users in rust.

@deontologician
Copy link
Owner Author

Apparently now there is just https://github.com/huggingface/tokenizers rust tokenizers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant