-
-
Notifications
You must be signed in to change notification settings - Fork 927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
plain input/output prompt strategy w/o chat templates #1346
Conversation
Thanks @winglian for starting this. Some important things for discussion:
# If there is only 1 message in the thread, need both EOS and BOS
if len(messages) == 1:
... = tokenize(add_eos_token=True, strip_bos_token=False)
# If first message in the thread with many messages, then we need BOS but not EOS
elif first message in the thread and len(messages) > 1:
.... = tokenize(add_eos_token=False, strip_bos_token=False)
# inputs don't need an EOS token
if msg.type='input' and NOT the first message in the thread:
.... = tokenize(add_eos_token=False, strip_bos_token=True)
# outputs should have an EOS token always?
elif msg.type='output' and NOT first message in the thread:
.... = tokenize(add_eos_token=True, strip_bos_token=True) Please check as sometimes I get this wrong - the actual code could be made more succint and readable, I just wrote it out long form to get the idea accross EDIT: I talked with @winglian and we think that its better not to do any magic at all (so completely forget #3), and hand off all responsibility to the user to add EOS,BOS,EOT, etc. as well as whitespaces or newlines in b/w things. This will not only simplify the code, but give users extreme freedom w/strings |
8c138e1
to
6279be8
Compare
* plain input/output prompt strategy w/o chat templates * disable duplicate code check * make sure to add an eos/eot token to the end of the output so it will stop * multi turn segement support and test
No description provided.