-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix llamacpp caching by making LlamaCppTokenizer
an outlines Tokenizer
#929
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the most important thing we need is a test that helps avoid issues like this in the future.
@brandonwillard should be ready for re-review when you have a chance by the way. A minimal reproducer test added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed this, but shouldn't LlamaCppTokenizer
extend our Tokenizer
class in the first place? If that's the case, we can use that interface and simply copy the relevant parts (e.g. serialization methods) of our transformers
Tokenizer
subclass and be done.
I agree that we should work towards that in general (relevant discussion #695) vLLM also uses a separate tokenizer which doesn't guarantee the same interface as We should address this first to ensure tests are passing in |
We can start that work here. It should only involve a change to |
LlamaCppTokenizer
an outlines Tokenizer
I've implemented It cannot be loaded from disk because
|
fdee645
to
9333a25
Compare
outlines/models/llamacpp.py
Outdated
def __eq__(self, other): | ||
return hash(self) == hash(other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't be consistent due to hash conflicts. The members/fields of this class—aside from self.tokenizer
—should be sufficient to define this method, just as they are for __getstate__
and __hash__
. (If they're not, then we need to add any missing members/fields to this class.)
N.B. See __eq__
and __hash__
for more info.
00d1597
to
280611f
Compare
Thanks for your feedback @brandonwillard ! I've incorporated your requested changes. |
def __hash__(self): | ||
# cache object hash | ||
if self._hash is None: | ||
self._hash = hash(pickle.dumps(self)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think pickle.dumps
handles dict
key and/or set
value ordering, so this might not be a good approach.
I think I see what you were trying to do previously with the sorting, but it doesn't matter for serialization. It might only seem to matter because you're mixing the serialization interface with the equivalence check and hashing. That's not necessary, though; you can compare the relevant member objects directly in __eq__
and take special steps that are good for hashing in __hash__
and only there (e.g. json.dumps(..., sort_keys=True)
for dict
s seems to be favored by many).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I see what you were trying to do previously with the sorting, but it doesn't matter for serialization. It might only seem to matter because you're mixing the serialization interface with the equivalence check and hashing. That's not necessary
It matters because the only reason Tokenizer
s are serializable via __setstate__
is because their serialized form is used to create a stable hash for the @cache
d create_states_mapping(regex_string, tokenizer)
. Tokenizers are never deserialized. They are serialized for hashing.
(e.g. json.dumps(..., sort_keys=True) for dicts seems to be favored by many).
json.dumps
is much slower than pickle and pickling a Tokenizer
is already 0.25 seconds, which matters because every time we create an FSM index we check the cache which has the pickled tokenizer as a key.
Dicts have stable order since 3.6, and while I successfully experimented with this, I don't know of a guarantee pickle maintains order. How about we revert to sorting then pickling to be safe?
But this discussion brings up an important point. Once other index construction bottlenecks are taken care of by #795 maybe we should address the performance issues I just described. We should only calculate the hash of the serialized tokenizer once. This is much better than serializing a tokenizer every single time create_states_mapping()
is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It matters because the only reason
Tokenizer
s are serializable via__setstate__
is because their serialized form is used to create a stable hash for the@cache
dcreate_states_mapping(regex_string, tokenizer)
. Tokenizers are never deserialized. They are serialized for hashing.
You're still mixing up serialization with equivalence and hashing. The hashing we're talking about here (i.e. __hash__
) is completely independent of any caching with cache
.
Also, if you want to address the potential cache misses due to dict
and set
ordering, that can be done in CloudpickleDisk
. That's where serialization is used for cache indexing purposes.
json.dumps
is much slower than pickle and pickling aTokenizer
is already 0.25 seconds, which matters because every time we create an FSM index we check the cache which has the pickled tokenizer as a key.
We can use whatever is sufficiently performant and accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we revert to sorting then pickling to be safe?
If that's the best option, then we might as well make sure that self.vocabulary
is sorted upon construction/creation. Sometimes these dict
s are already sorted by token ID, in which case that canonicalization step would be rather efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created #933 to address potential concerns regarding cache misses.
In cloudpickle
, dicts are deterministic for versions > 3.7, but sets are not. I pre-sorted the vocabulary in the latest push, and sort the special-tokens set when __getstate__
is called.
Please review the latest changes.
62e1ef6
to
397cdb5
Compare
self.eos_token_id, | ||
self.eos_token, | ||
self.pad_token_id, | ||
sorted(self.special_tokens), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to ask why special_tokens
isn't sorted in the constructor, but now I don't even see where/how it's being populated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it might be adapt_tokenzier
. That's problematic, since it means that these class instances aren't actually immutable, which—at the very least—invalidates the hashability requirement.
It looks like we need to change adapt_tokenizer
so that it returns a completely new Tokenizer
instance, or perhaps integrate the changes made by adapt_tokenizer
into the Tokenizer
subclasses directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to ask why
special_tokens
isn't sorted in the constructor, but now I don't even see where/how it's being populated.
cloudpickle
has non-deterministic behavior for sets, (even FrozenSet
s) we need to convert to a sorted list when serializing to ensure a stable hash.
Oh, it might be adapt_tokenzier. That's problematic, since it means that these class instances aren't actually immutable, which—at the very least—invalidates the hashability requirement.
adapt_tokenizer
doesn't apply to llamacpp
for now. It's only called in integrations/transformers.py
and integrations/vllm.py
.
It looks like we need to change adapt_tokenizer so that it returns a completely new Tokenizer instance, or perhaps integrate the changes made by adapt_tokenizer into the Tokenizer subclasses directly.
IMHO we should create a separate issue to unify / correct the behavior and interfaces of tokenizers in general to prevent the scope of this PR from growing too large. This PR doesn't introduce any new problems with LlamaCppTokenizer
, but it does fix the tests which have been failing in main
all week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can merge this now in order to prevent any more errors, but we need some follow-up issues for the mutability inconsistencies related to these types.
I agree 100% #936 |
Fixes #922
Problem
integrations/llamacpp.py
/LlamaCppTokenizer
hasself.decode = model.tokenizer.decode
. This is not pickle-able. Attempting to pass aLlamaCppTokenizer
to a@cache
d function results incloudpickle.dumps(tokenizer)
which causesValueError: ctypes objects containing pointers cannot be pickled
error.Locally most
test_integration_llamacpp.py
fail. (full test failure details in #922)Solution
Make
LlamaCppTokenizer
subclassoutlines.models.tokenizer.Tokenizer
and implement its abstract methods. Specifically__getstate__
is necessary to ensurecloudpickle.dumps
works and hashing is possible.