-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate TextEncoder and Tokenizer from Transformers.jl #199
Comments
It's kinda convoluted, but before that, could you elaborate on the awkwardness? It might be an important point for the overall design. So as you might already know, BPE.jl has a lightweight interface ( |
Apologies, this could be an issue on my side, but I was genuinely confused about what to use for something as simple as encoding (getting id vector from string) and decoding (getting back a string from id vector).
I wanted to use the tokenizer here: https://huggingface.co/RWKV/rwkv-4-430m-pile
I have an idea for this, would it make sense to make all HuggingFace tokenization stuff a part of TextEncodeBase.jl (or BPE.jl, I'll let your decide that) as extensions that depend on HuggingfaceApi.jl? While the above is one suggestion, another alternative to it is to divide up capabilities of Transformers.jl into different extensions. |
@chengchingwen As a side note, the tokenizer I loaded using the above method isn't decoding properly. Encoder works well, decoder is giving me a string which has \u0120 character in it, when its supposed to be replaced by space during decoding "\u0120am" should be decoded as " am" and I think thats how GPT2 decoder is supposed to work too. This for some other characters as well |
It's ok, this situation itself is messy. The thing is, all the implementations can get id vector from string and get string back from id vector, but it might not be the one you want. For example, do you need pre/post-processing (e.g. batching, padding, truncation) and do you want the tokenizer to behave the same as huggingface/transformers' tokenizer? Some functions and setting are hardcoded in the python code (per model) and not serialized in the file on the hub.
The behavior is actually expected. It involves a few implementation details. The bpe/vocab you load are operate on the transformed codepoints (e.g. with the \u0120). The code mapping/unmapping are part of the postprocessing, but
Extension is definitely needed, but we still need a package to hold the loader code. I kinda wonder if we could move the loader code to HuggingFaceApi.jl and make it the parent module of the extensions. It would no longer be a lightweight package, but it seems Transformers.jl is the only dependent of HuggingFaceApi.jl, so probably nobody would really against it. Then we can have |
Currently I need to load a tokenizer from HuggingFace, and use it for simply encoding and decoding sentences. While doing that from Transformers.jl interface is awkward already (I had to go
tok = Transformers.HuggingFace.load_tokenizer("model") |> r -> BytePairEncoding.BPEEncoder(BytePairEncoding.BPETokenizer(r.tokenizer.tokenization), r.vocab)
, thats its own issue), just for encoding and decoding purposes alone, loading something as big as Flux is not justified in my opinion.Is there any way that the text encoding and tokenization can be made a part of something else, for example, TextEncodeBase (with HuggingFace api being included into HuggingFaceApi.jl extension)?
The text was updated successfully, but these errors were encountered: