Refactor/tokenizers #840

collindutter · 2024-06-06T19:50:32Z

Added

BaseTokenizer.prompt_stack_to_string() to convert a Prompt Stack to a string.
BaseTokenizer.prompt_stack_input_to_string() to convert a Prompt Stack Input to a ChatML-style message dictionary.

Changed

BREAKING: Removed BasePromptDriver.count_tokens().
BREAKING: Removed BasePromptDriver.max_output_tokens().
BREAKING: Moved BasePromptDriver.prompt_stack_to_string() to BaseTokenizer.
BREAKING: Moved/renamed PromptStack.add_to_conversation_memory to BaseConversationMemory.add_to_prompt_stack.
BaseTokenizer.count_tokens() can now approximately token counts given a Prompt Stack.
Updated Prompt Drivers to use BasePromptDriver.max_tokens instead of using BasePromptDriver.max_output_tokens().
BREAKING: Moved griptape.constants.RESPONSE_STOP_SEQUENCE to ToolkitTask.
ToolkitTask.RESPONSE_STOP_SEQUENCE is now only added when using ToolkitTask.

codecov · 2024-06-06T22:34:30Z

Codecov Report

Attention: Patch coverage is 96.17834% with 6 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...pe/drivers/prompt/huggingface_hub_prompt_driver.py	80.00%	1 Missing and 1 partial ⚠️
...ivers/prompt/huggingface_pipeline_prompt_driver.py	80.00%	1 Missing and 1 partial ⚠️
griptape/drivers/prompt/dummy_prompt_driver.py	50.00%	1 Missing ⚠️
griptape/tokenizers/base_tokenizer.py	91.66%	0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

andrewfrench · 2024-06-10T21:54:02Z

griptape/tokenizers/base_tokenizer.py

+    def prompt_stack_input_to_message(self, prompt_input: PromptStack.Input) -> dict:
+        """Converts a PromptStack Input to a ChatML-style message dictionary for token counting or model input.
+
+        Args:
+            prompt_input: The PromptStack Input to convert.
+
+        Returns:
+            A dictionary with the role and content of the input.
+        """
+        content = prompt_input.content
+
+        if prompt_input.is_system():
+            return {"role": "system", "content": content}
+        elif prompt_input.is_assistant():
+            return {"role": "assistant", "content": content}
+        else:
+            return {"role": "user", "content": content}


Should this be a method on PromptInput?

I don't think so, each Tokenizer will have a slightly different implementation of this. In an upcoming PR i'm changing this to an abstract method.

andrewfrench · 2024-06-10T21:58:08Z

griptape/tokenizers/base_tokenizer.py

+
+    def prompt_stack_to_string(self, prompt_stack: PromptStack) -> str:
+        """Converts a Prompt Stack to a string for token counting or model input.
+        This base implementation will not be very accurate, and should be overridden by subclasses with model-specific tokens.


Originally same question here, but I sense that the model-specific method means this should be owned by the tokenizer.

Honestly we could probably get rid of this method. All of our Prompt Drivers now support message-style APIs.

Ah but now I remember why I didn't remove in this PR: I didn't want to remove the token counts in the events which relies on this method. My next PR changes this functionality so I will remove this method.

andrewfrench · 2024-06-10T22:03:08Z

griptape/tokenizers/base_tokenizer.py


    def _default_max_input_tokens(self) -> int:
        tokens = next((v for k, v in self.MODEL_PREFIXES_TO_MAX_INPUT_TOKENS.items() if self.model.startswith(k)), None)

        if tokens is None:
-            raise ValueError(f"Unknown model default max input tokens: {self.model}")
+            return self.DEFAULT_MAX_INPUT_TOKENS


Seems reasonable to fall back to the default here. Do you think it'd be worth it to throw out a warning if this happens? It seems like a big enough difference in expectations between the known values defined in the mapping and a default 'best guess' value, a behavior we seem to have avoided until now.

Yeah warning is a good idea.

dylanholmes

Just a couple of questions

dylanholmes · 2024-06-11T17:45:41Z

docs/griptape-framework/misc/tokenizers.md


 ### Simple
 Not all LLM providers have a public tokenizer API. In this case, you can use the `SimpleTokenizer` to count tokens based on a simple heuristic. 

 ```python
 from griptape.tokenizers import SimpleTokenizer

-tokenizer = SimpleTokenizer(max_input_tokens=1024, max_output_tokens=1024, characters_per_token=6)
+tokenizer = SimpleTokenizer(model="any-model", max_input_tokens=1024, max_output_tokens=1024, characters_per_token=6)


Just curious, why can't model be optional?

griptape/drivers/prompt/huggingface_pipeline_prompt_driver.py

griptape/tasks/prompt_task.py

dylanholmes

Nice work!

collindutter force-pushed the refactor/tokenizers branch 10 times, most recently from 74437d1 to b00d200 Compare June 6, 2024 22:22

collindutter marked this pull request as ready for review June 6, 2024 22:22

collindutter requested review from dylanholmes, andrewfrench and vasinov June 6, 2024 22:22

collindutter force-pushed the refactor/tokenizers branch 2 times, most recently from eb39a02 to 9e1533e Compare June 6, 2024 22:32

collindutter force-pushed the refactor/tokenizers branch 3 times, most recently from 11cdd03 to 4377389 Compare June 6, 2024 23:01

collindutter mentioned this pull request Jun 6, 2024

Remove old tokenizer docs #841

Merged

collindutter force-pushed the refactor/tokenizers branch from 4377389 to fb52f7b Compare June 6, 2024 23:57

Refactor how Prompt Drivers use Tokenizers

ac2b5da

collindutter force-pushed the refactor/tokenizers branch from fb52f7b to ac2b5da Compare June 7, 2024 18:46

collindutter added 5 commits June 7, 2024 12:58

Merge branch 'dev' into refactor/tokenizers

50b7432

Simplify HuggingFaceTokenizer

15045bd

Allow for overriding of ToolkitTask stop sequence

a5b7435

Remove todo

6cd098a

SimpleTokenizer improvements

6bd7262

andrewfrench reviewed Jun 10, 2024

View reviewed changes

collindutter force-pushed the refactor/tokenizers branch from 62b4f47 to bc12fa5 Compare June 11, 2024 16:39

collindutter requested a review from andrewfrench June 11, 2024 16:57

collindutter added 6 commits June 11, 2024 10:47

Merge branch 'dev' into refactor/tokenizers

4c48276

Move prompt transformation back to prompt drivers

1a07a06

Add warning when using default tokens

13c55c7

Remove log

66698dc

Rename method, remove from base-class

f3d44c7

Improve test coverage

3e0a24a

collindutter force-pushed the refactor/tokenizers branch from 649dcda to 3e0a24a Compare June 11, 2024 17:47

dylanholmes reviewed Jun 11, 2024

View reviewed changes

Change AmazonBedrockTokenizer parent

9b0bf64

collindutter requested a review from dylanholmes June 11, 2024 18:39

dylanholmes previously approved these changes Jun 11, 2024

View reviewed changes

Fix doc

ecf71a9

collindutter dismissed dylanholmes’s stale review via ecf71a9 June 11, 2024 19:27

collindutter requested a review from dylanholmes June 11, 2024 19:27

dylanholmes approved these changes Jun 11, 2024

View reviewed changes

collindutter merged commit 331c331 into dev Jun 11, 2024
10 checks passed

collindutter deleted the refactor/tokenizers branch June 11, 2024 19:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor/tokenizers #840

Refactor/tokenizers #840

collindutter commented Jun 6, 2024 •

edited

Loading

codecov bot commented Jun 6, 2024 •

edited

Loading

andrewfrench Jun 10, 2024

collindutter Jun 10, 2024

andrewfrench Jun 10, 2024 •

edited

Loading

collindutter Jun 10, 2024

collindutter Jun 10, 2024

andrewfrench Jun 10, 2024 •

edited

Loading

collindutter Jun 10, 2024

dylanholmes left a comment

dylanholmes Jun 11, 2024

dylanholmes left a comment

Refactor/tokenizers #840

Refactor/tokenizers #840

Conversation

collindutter commented Jun 6, 2024 • edited Loading

Added

Changed

codecov bot commented Jun 6, 2024 • edited Loading

Codecov Report

andrewfrench Jun 10, 2024

Choose a reason for hiding this comment

collindutter Jun 10, 2024

Choose a reason for hiding this comment

andrewfrench Jun 10, 2024 • edited Loading

Choose a reason for hiding this comment

collindutter Jun 10, 2024

Choose a reason for hiding this comment

collindutter Jun 10, 2024

Choose a reason for hiding this comment

andrewfrench Jun 10, 2024 • edited Loading

Choose a reason for hiding this comment

collindutter Jun 10, 2024

Choose a reason for hiding this comment

dylanholmes left a comment

Choose a reason for hiding this comment

dylanholmes Jun 11, 2024

Choose a reason for hiding this comment

dylanholmes left a comment

Choose a reason for hiding this comment

collindutter commented Jun 6, 2024 •

edited

Loading

codecov bot commented Jun 6, 2024 •

edited

Loading

andrewfrench Jun 10, 2024 •

edited

Loading

andrewfrench Jun 10, 2024 •

edited

Loading