Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REQUEST] Don't error when max_tokens request are too long causing Job required pages too small, just generate up to the available pages. #262

Open
3 tasks done
Originalimoc opened this issue Dec 15, 2024 · 4 comments

Comments

@Originalimoc
Copy link

Originalimoc commented Dec 15, 2024

Problem

Along with this issue: #251, it makes you reaching console to restart it constantly if you hit the available pages, or it just hangs if you switch model...

Solution

As title.

Acknowledgements

  • I have looked for similar requests before submitting this one.
  • I understand that the developers have lives and my issue will be answered when possible.
  • I understand the developers of this program are human, and I will make my requests politely.
@DocShotgun
Copy link
Member

DocShotgun commented Dec 15, 2024

This has previously been discussed that the client would be responsible for managing the length of the request (the tokenization endpoint was designed for this purpose) rather than letting tabby decide how to truncate the prompt. The rationale here is that the client would better understand what portions of the prompt are the least important and should be dropped when the request becomes too long.

For chat completions, in theory tabby could try to "smart" truncate beginning with messages right after the system prompt if it is present - although this may still cause a problem for certain prompt formats with role order restrictions such as Mistral's (in Mistral prompt format, it wouldn't work to have the first message after the system prompt be an assistant role message if the first user message is dropped for length). But there also isn't really a "smart" way to do it for raw completions, and the beginning of the prompt would need to be dropped - so this is far better left up to the client (i.e. a frontend like SillyTavern that pre-formats a prompt and sends the request as a raw completion).

I was told by users of official OAI API that it is the behavior of the official API as well to simply error if the client sends a request that is too long. If this is inaccurate or has subsequently changed, then potentially auto-truncation could be considered as a feature, although there are several problems as mentioned above.

@Originalimoc
Copy link
Author

Not what I mean. I'm only saying if prompt PLUS max_tokens exceeds max pages then only gengerate up to (max_pages - prompts), if prompts too long just error. Not truncating anything.

@DocShotgun
Copy link
Member

Well you have two options here. You could either have the frontend take into account max_tokens and subtract that from the available sequence length (the recommended method), or you could just not pass max_tokens at all, which defaults to the maximum that will fit.

@Originalimoc
Copy link
Author

Isn't there a default? 250 or 150..?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants