Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor/tokenizers #840

Merged
merged 14 commits into from
Jun 11, 2024
Merged

Refactor/tokenizers #840

merged 14 commits into from
Jun 11, 2024

Conversation

collindutter
Copy link
Member

@collindutter collindutter commented Jun 6, 2024

Added

  • BaseTokenizer.prompt_stack_to_string() to convert a Prompt Stack to a string.
  • BaseTokenizer.prompt_stack_input_to_string() to convert a Prompt Stack Input to a ChatML-style message dictionary.

Changed

  • BREAKING: Removed BasePromptDriver.count_tokens().
  • BREAKING: Removed BasePromptDriver.max_output_tokens().
  • BREAKING: Moved BasePromptDriver.prompt_stack_to_string() to BaseTokenizer.
  • BREAKING: Moved/renamed PromptStack.add_to_conversation_memory to BaseConversationMemory.add_to_prompt_stack.
  • BaseTokenizer.count_tokens() can now approximately token counts given a Prompt Stack.
  • Updated Prompt Drivers to use BasePromptDriver.max_tokens instead of using BasePromptDriver.max_output_tokens().
  • BREAKING: Moved griptape.constants.RESPONSE_STOP_SEQUENCE to ToolkitTask.
  • ToolkitTask.RESPONSE_STOP_SEQUENCE is now only added when using ToolkitTask.

@collindutter collindutter force-pushed the refactor/tokenizers branch 10 times, most recently from 74437d1 to b00d200 Compare June 6, 2024 22:22
@collindutter collindutter marked this pull request as ready for review June 6, 2024 22:22
@collindutter collindutter force-pushed the refactor/tokenizers branch 2 times, most recently from eb39a02 to 9e1533e Compare June 6, 2024 22:32
Copy link

codecov bot commented Jun 6, 2024

@collindutter collindutter force-pushed the refactor/tokenizers branch 3 times, most recently from 11cdd03 to 4377389 Compare June 6, 2024 23:01
@collindutter collindutter force-pushed the refactor/tokenizers branch from 4377389 to fb52f7b Compare June 6, 2024 23:57
@collindutter collindutter force-pushed the refactor/tokenizers branch from fb52f7b to ac2b5da Compare June 7, 2024 18:46
Comment on lines 73 to 89
def prompt_stack_input_to_message(self, prompt_input: PromptStack.Input) -> dict:
"""Converts a PromptStack Input to a ChatML-style message dictionary for token counting or model input.

Args:
prompt_input: The PromptStack Input to convert.

Returns:
A dictionary with the role and content of the input.
"""
content = prompt_input.content

if prompt_input.is_system():
return {"role": "system", "content": content}
elif prompt_input.is_assistant():
return {"role": "assistant", "content": content}
else:
return {"role": "user", "content": content}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a method on PromptInput?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, each Tokenizer will have a slightly different implementation of this. In an upcoming PR i'm changing this to an abstract method.


def prompt_stack_to_string(self, prompt_stack: PromptStack) -> str:
"""Converts a Prompt Stack to a string for token counting or model input.
This base implementation will not be very accurate, and should be overridden by subclasses with model-specific tokens.
Copy link
Member

@andrewfrench andrewfrench Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally same question here, but I sense that the model-specific method means this should be owned by the tokenizer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly we could probably get rid of this method. All of our Prompt Drivers now support message-style APIs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah but now I remember why I didn't remove in this PR: I didn't want to remove the token counts in the events which relies on this method. My next PR changes this functionality so I will remove this method.


def _default_max_input_tokens(self) -> int:
tokens = next((v for k, v in self.MODEL_PREFIXES_TO_MAX_INPUT_TOKENS.items() if self.model.startswith(k)), None)

if tokens is None:
raise ValueError(f"Unknown model default max input tokens: {self.model}")
return self.DEFAULT_MAX_INPUT_TOKENS
Copy link
Member

@andrewfrench andrewfrench Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to fall back to the default here. Do you think it'd be worth it to throw out a warning if this happens? It seems like a big enough difference in expectations between the known values defined in the mapping and a default 'best guess' value, a behavior we seem to have avoided until now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah warning is a good idea.

@collindutter collindutter force-pushed the refactor/tokenizers branch from 649dcda to 3e0a24a Compare June 11, 2024 17:47
Copy link
Contributor

@dylanholmes dylanholmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of questions


### Simple
Not all LLM providers have a public tokenizer API. In this case, you can use the `SimpleTokenizer` to count tokens based on a simple heuristic.

```python
from griptape.tokenizers import SimpleTokenizer

tokenizer = SimpleTokenizer(max_input_tokens=1024, max_output_tokens=1024, characters_per_token=6)
tokenizer = SimpleTokenizer(model="any-model", max_input_tokens=1024, max_output_tokens=1024, characters_per_token=6)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, why can't model be optional?

griptape/tasks/prompt_task.py Show resolved Hide resolved
@collindutter collindutter requested a review from dylanholmes June 11, 2024 18:39
dylanholmes
dylanholmes previously approved these changes Jun 11, 2024
Copy link
Contributor

@dylanholmes dylanholmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@collindutter collindutter merged commit 331c331 into dev Jun 11, 2024
10 checks passed
@collindutter collindutter deleted the refactor/tokenizers branch June 11, 2024 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants