Max tokens per question and context #54

reavessm · 2024-09-18T12:21:54Z

Closes: RHCLOUD-34879

bsquizz

Nice, you're on the right track here. I looked into the models supported by tiktoken out of the box and it doesn't support the LLM we've been using so far (mistral-7b-instruct). However, it looks like mistral has some python libraries that could calculate the number of tokens: https://docs.mistral.ai/guides/tokenization/

The only thing is it looks like we'll have to convert the langchain message history into the "mistral-common" equivalent classes. So you'd have to copy the msg_list but use the different classes for each message.

HumanMessage becomes UserMessage
SystemMessage is also named SystemMessage
I think AIMessage becomes AssistantMessage

Or I think you could use ChatMessage for all of them and set the role to be the appropriate role.

See:

Lastly, I think that when we trim the messages we need to make sure we keep the initial SystemMessage in the list because that is instructing the model on how to behave

bsquizz · 2024-09-19T03:23:24Z

connectors/llm/interface.py

@@ -44,15 +45,31 @@ def ask(self, system_prompt, previous_messages, question, agent_id, stream):
        prompt_params = {"context": context_text, "question": question}
        log.debug("search result: %s", context_text)

+        # If tiktoken doesn't support our model, default to gpt2
+        try:
+            text_splitter = tiktoken.encoding_for_model(cfg.LLM_MODEL_NAME)


The LLM_MODEL_NAME can actually be an arbitrary name. It's an identifier used on the server side. What I mean is, the model might be accessed using the name mistral-7b-instruct on the requests made to the hosting server, but the model is actually Mistral-7B-Instruct-v0.3

So I think you'll want another config variable like cfg.TOKENIZER_MODEL_NAME

when we trim the messages we need to make sure we keep the initial SystemMessage in the list because that is instructing the model on how to behave

We're not really "trimming" as much as we're "selectively not adding". This nuance matters because the selection (token length calculation) happens after we add the default prompt here: https://github.com/RedHatInsights/tangerine-backend/pull/54/files#diff-abbf9cb2997932bbf240cd1e9f186f47e5c4c6cc15305f52aece62baa3e0fed1R56 .

So if there's anything else we want to make sure we keep, we can add it before the previous_messages loop as well

Signed-off-by: Stephen Reaves <[email protected]>

bsquizz

Getting closer here I think :) But I have several questions/comments

bsquizz · 2024-10-21T21:29:00Z

connectors/llm/interface.py

-                    msg_list.append(AIMessage(content=f"{msg['text']}</s>"))
+                    # The tokenizer requires that every request begins with a
+                    # SystemMessage or a UserMessage, so we tokenize the AI
+                    # response as a UserMessage, but append to the list as an


Would the AssistantMessage be the right one to use here?

bsquizz · 2024-10-21T21:30:45Z

connectors/llm/interface.py

                if msg["sender"] == "human":
-                    msg_list.append(HumanMessage(content=f"[INST] {msg['text']} [/INST]"))
+                    token_list = len(text_splitter.encode_chat_completion(ChatCompletionRequest(messages=[UserMessage(content=msg["text"])])).tokens)


Would it be OK if we change the name of this variable? I was expecting token_list to be a list type ... but it is just an integer right?

bsquizz · 2024-10-21T21:31:05Z

connectors/llm/interface.py

+                    # SystemMessage or a UserMessage, so we tokenize the AI
+                    # response as a UserMessage, but append to the list as an
+                    # AIMessage.
+                    token_list = len(text_splitter.encode_chat_completion(ChatCompletionRequest(messages=[UserMessage(content=f"{msg['text']}</s>")])).tokens)


Same note about the name of token_list as my prior comment

bsquizz · 2024-10-21T21:36:52Z

connectors/llm/interface.py


        prompt = ChatPromptTemplate.from_template(cfg.USER_PROMPT_TEMPLATE)
        prompt_params = {"context": context_text, "question": question}
        log.debug("search result: %s", context_text)

+        text_splitter = MistralTokenizer.v3(is_tekken=True)


Should this variable be named tokenizer so that this doesn't become confused with other text splitters in the code base ?

bsquizz · 2024-10-21T21:44:08Z

resources/agent.py

+        # Tokenizer doesn't like including the first two tokens when
+        # decoding...
+        if len(tokens_question) > MAX_TOKENS_QUESTION+2:
+            log.debug("Question too big, truncating...")


I'm not sure if we ever want to truncate the asked question. Or at least hopefully, we never would need to unless the user asked a ridiculously long question. I think we should only look at the total of current question+message history and start to drop previous_messages if the number of tokens becomes too large. I think all the token counting and truncation could happen within the ask function, let me know if this idea is off base.

bsquizz · 2024-10-21T21:48:37Z

connectors/llm/interface.py

+
+                total_tokens += token_list
+                if token_list + total_tokens >= cfg.MAX_TOKENS_CONTEXT:
+                    print()


Was this print in there for debugging? Should it be removed?

bsquizz · 2024-10-21T21:55:28Z

connectors/llm/interface.py

@@ -38,28 +43,55 @@ def ask(self, system_prompt, previous_messages, question, agent_id, stream):
                if "title" in metadata:
                    title = metadata["title"]
                    context_text += f", document title: '{title}'"
-                context_text += ">>\n\n" f"{page_content}\n\n" f"<<Search result {i+1} END>>\n"
+                context_text += (">>\n\n" f"{page_content}\n\n" f"<<Search result {i+1} END>>\n")

        prompt = ChatPromptTemplate.from_template(cfg.USER_PROMPT_TEMPLATE)
        prompt_params = {"context": context_text, "question": question}


Here's where we become aware of the question content. We need to somehow count the number of tokens for the question somewhere around here and add it to the running total right?

bsquizz requested changes Sep 19, 2024

View reviewed changes

reavessm force-pushed the main branch from 4840f8c to 50ebfd9 Compare September 23, 2024 13:41

Stephen Reaves added 2 commits September 25, 2024 14:21

Max tokens per question and context

a2b7e1e

Signed-off-by: Stephen Reaves <[email protected]>

Mistral toknizer

69a577c

Signed-off-by: Stephen Reaves <[email protected]>

reavessm force-pushed the main branch from 50ebfd9 to 69a577c Compare September 25, 2024 18:22

bsquizz requested changes Oct 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Max tokens per question and context #54

Max tokens per question and context #54

reavessm commented Sep 18, 2024

bsquizz left a comment

bsquizz Sep 19, 2024

reavessm Sep 19, 2024 •

edited

Loading

bsquizz left a comment

bsquizz Oct 21, 2024

bsquizz Oct 21, 2024

bsquizz Oct 21, 2024

bsquizz Oct 21, 2024

bsquizz Oct 21, 2024

bsquizz Oct 21, 2024

bsquizz Oct 21, 2024

Max tokens per question and context #54

Are you sure you want to change the base?

Max tokens per question and context #54

Conversation

reavessm commented Sep 18, 2024

bsquizz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reavessm Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

bsquizz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reavessm Sep 19, 2024 •

edited

Loading