Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plain input/output prompt strategy w/o chat templates #1346

Merged
merged 4 commits into from
Mar 4, 2024

Conversation

winglian
Copy link
Collaborator

No description provided.

@winglian winglian requested a review from hamelsmu February 28, 2024 19:16
@hamelsmu
Copy link
Collaborator

hamelsmu commented Feb 28, 2024

Thanks @winglian for starting this. Some important things for discussion:

  1. We need multi-turn support (you are currently using sharegpt for this, I would want to handle single-turn as just a subset of multi-turn. My suggestion is not to add separate prompt strategies for multi and single turn and just only have multi-turn.
  2. Can you wire it up to the config etc so I can test it? I can help writing docs and tests.
  3. Do you have any opinions on the parameters add_eos_token and strip_bos_token for the tokenizer? Do we need to set those (especially considering the multi-turn case)? I think it makes sense to perhaps do something like this (psuedo-code below)
# If there is only 1 message in the thread, need both EOS and BOS
if len(messages) == 1:
... = tokenize(add_eos_token=True, strip_bos_token=False)

# If first message in the thread with many messages, then we need BOS but not EOS 
elif first message in the thread and len(messages) > 1:
 .... =  tokenize(add_eos_token=False, strip_bos_token=False)

# inputs don't need an EOS token
if msg.type='input' and NOT the first message in the thread: 
 .... =  tokenize(add_eos_token=False, strip_bos_token=True)

# outputs should have an EOS token always?
elif msg.type='output' and NOT first message in the thread:
.... = tokenize(add_eos_token=True, strip_bos_token=True)

Please check as sometimes I get this wrong - the actual code could be made more succint and readable, I just wrote it out long form to get the idea accross

EDIT: I talked with @winglian and we think that its better not to do any magic at all (so completely forget #3), and hand off all responsibility to the user to add EOS,BOS,EOT, etc. as well as whitespaces or newlines in b/w things. This will not only simplify the code, but give users extreme freedom w/strings

@winglian winglian requested a review from hamelsmu February 29, 2024 22:01
@winglian winglian merged commit 4d09b42 into main Mar 4, 2024
6 checks passed
@winglian winglian deleted the io-prompt-strategy branch August 16, 2024 14:43
djsaunde pushed a commit that referenced this pull request Dec 17, 2024
* plain input/output prompt strategy w/o chat templates

* disable duplicate code check

* make sure to add an eos/eot token to the end of the output so it will stop

* multi turn segement support and test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants