Skip to content

Releases: mistralai/mistral-common

Patch Release v1.5.1

20 Nov 18:07
Compare
Choose a tag to compare

What's Changed

Full Changelog: v1.5.0...v1.5.1

1.5.0 - Mistral Tokenizer v7 (new System Prompt + Fn calling)

15 Nov 19:32
Compare
Choose a tag to compare

Mistral's newest tokenizer has two major improvements:

System prompt

Similar to other tokenization schemes the system prompt is now treated as a "normal" message encapsulated by [SYSTEM_PROMPT] ...[\SYSTEM_PROMPT]

E.g.

from mistral_common.protocol.instruct.messages import (
    UserMessage,
    SystemMessage,
    AssistantMessage,
)
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer

# Load Mistral tokenizer
tokenizer = MistralTokenizer.v7()

# Tokenize a list of messages
tokenized = tokenizer.encode_chat_completion(
    ChatCompletionRequest(
        messages=[
            SystemMessage(content="You are a funny AI assistant. Always make jokes."),
            UserMessage(content="What's the weather like today in Paris"),
        ],
        model="joker",
    )
)
tokens, text = tokenized.tokens, tokenized.text

print(text)
# <s>[SYSTEM_PROMPT]โ–Youโ–areโ–aโ–funnyโ–AIโ–assistant.โ–Alwaysโ–makeโ–jokes.[/SYSTEM_PROMPT][INST]โ–What'sโ–theโ–weatherโ–likeโ–todayโ–inโ–Paris[/INST]

Improve function calling

A new [TOOL_CONTENT] is added if trained with correctly should improve the accuracy of function calling.

from mistral_common.protocol.instruct.messages import (
    UserMessage,
    SystemMessage,
    AssistantMessage,
    ToolMessage
)
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.protocol.instruct.tool_calls import (
    Function,
    Tool,
)
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer

# Load Mistral tokenizer
tokenizer = MistralTokenizer.v7()

tokenized = tokenizer.encode_chat_completion(
    ChatCompletionRequest(
      tools=[
                  Tool(
                      function=Function(
                          name="get_current_weather",
                          description="Get the current weather",
                          parameters={
                              "type": "object",
                              "properties": {
                                  "location": {
                                      "type": "string",
                                      "description": "The city and state, e.g. San Francisco, CA",
                                  },
                                  "format": {
                                      "type": "string",
                                      "enum": ["celsius", "fahrenheit"],
                                      "description": "The temperature unit to use. Infer this from the users location.",
                                  },
                              },
                              "required": ["location", "format"],
                          },
                      )
                  )
              ],
              messages=[
                  UserMessage(content="What's the weather like today in Paris"),
                  AssistantMessage(content="", tool_calls=[
                    {
                        "id": "bbc5b7ede",
                        "type": "function",
                        "function": {
                            "name": "weather",
                            "arguments": '{"location": "Paris", "format": "celsius"}',
                        },
                    }
                ]),
                ToolMessage(content="24 degrees celsius", tool_call_id="bbc5b7ede"),
              ],
        model="joker",
    )
)
tokens, text = tokenized.tokens, tokenized.text

# Count the number of tokens
print(text)
# <s>[AVAILABLE_TOOLS]โ–[{"type":โ–"function",โ–"function":โ–{"name":โ–"get_current_weather",โ–"description":โ–"Getโ–theโ–currentโ–weather",โ–"parameters":โ–{"type":โ–"object",โ–"properties":โ–{"location":โ–{"type":โ–"string",โ–"description":โ–"Theโ–cityโ–andโ–state,โ–e.g.โ–Sanโ–Francisco,โ–CA"},โ–"format":โ–{"type":โ–"string",โ–"enum":โ–["celsius",โ–"fahrenheit"],โ–"description":โ–"Theโ–temperatureโ–unitโ–toโ–use.โ–Inferโ–thisโ–fromโ–theโ–usersโ–location."}},โ–"required":โ–["location",โ–"format"]}}}][/AVAILABLE_TOOLS][INST]โ–What\'sโ–theโ–weatherโ–likeโ–todayโ–inโ–Paris[/INST][TOOL_CALLS]โ–[{"name":โ–"weather",โ–"arguments":โ–{"location":โ–"Paris",โ–"format":โ–"celsius"},โ–"id":โ–"bbc5b7ede"}]</s>[TOOL_RESULTS]โ–bbc5b7ede[TOOL_CONTENT]โ–24โ–degreesโ–celsius[/TOOL_RESULTS]'

Patch release - v1.4.4

29 Sep 13:02
21ee9f6
Compare
Choose a tag to compare

Make sure broken user envs of cv2 (which sadly happens more often than not) don't impede users from using text-only models.

What's Changed

Full Changelog: v1.4.3...v1.4.4

Patch release v1.4.3 - Make cv2 install optional

22 Sep 16:01
ce9ce79
Compare
Choose a tag to compare

As per discussion: vllm-project/vllm#8650 make cv2 optional.

What's Changed

Full Changelog: v1.4.2...v1.4.3

Patch release v1.4.2

18 Sep 12:46
992f4a0
Compare
Choose a tag to compare

Make sure to send user agent for downloading pictures that require a user agent. E.g.:

from mistral_common.protocol.instruct.messages import (
    UserMessage,
    TextChunk,
    ImageURLChunk,
    ImageChunk,
)
from PIL import Image
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer

tokenizer = MistralTokenizer.from_model("pixtral")

url_dog = "https://picsum.photos/id/237/200/300"
url_mountain = "https://picsum.photos/seed/picsum/200/300"
url1 = "https://upload.wikimedia.org/wikipedia/commons/d/da/2015_Kaczka_krzy%C5%BCowka_w_wodzie_%28samiec%29.jpg"
url2 = "https://upload.wikimedia.org/wikipedia/commons/7/77/002_The_lion_king_Snyggve_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent.jpg"


# tokenize image urls and text
tokenized = tokenizer.encode_chat_completion(
    ChatCompletionRequest(
        messages=[
            UserMessage(
                content=[
                    TextChunk(text="Can this animal"),
                    ImageURLChunk(image_url=url1),
                    TextChunk(text="live here?"),
                    ImageURLChunk(image_url=url2),
                ]
            )
        ],
        model="pixtral",
    )
)
tokens, text, images = tokenized.tokens, tokenized.text, tokenized.images

# Count the number of tokens
print("# tokens", len(tokens))
print("# images", len(images))

What's Changed

New Contributors

Full Changelog: v1.4.1...v1.4.2

Patch release v1.4.1 - Use cv2 resize instead of PIL

17 Sep 08:44
bae45b2
Compare
Choose a tag to compare

cv2 resize gives significantly better results when running pixtral in inference as compared to PIL hence we're making a patch release to resize images using cv2 as shown here: bae45b2

v1.4.0 - Mistral common goes ๐Ÿ–ผ๏ธ

10 Sep 22:44
7b88116
Compare
Choose a tag to compare

Pixtral is out!

Mistral common has image support! You can now pass images and URLs alongside text into the user message.

pip install --upgrade mistral_common

Images

You can encode images as follows

from mistral_common.protocol.instruct.messages import (
    UserMessage,
    TextChunk,
    ImageURLChunk,
    ImageChunk,
)
from PIL import Image
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer

tokenizer = MistralTokenizer.from_model("pixtral")

image = Image.new('RGB', (64, 64))

# tokenize images and text
tokenized = tokenizer.encode_chat_completion(
    ChatCompletionRequest(
        messages=[
            UserMessage(
                content=[
                    TextChunk(text="Describe this image"),
                    ImageChunk(image=image),
                ]
            )
        ],
        model="pixtral",
    )
)
tokens, text, images = tokenized.tokens, tokenized.text, tokenized.images

# Count the number of tokens
print("# tokens", len(tokens))
print("# images", len(images))

Image URLs

You can pass image url which will be automatically downloaded

url_dog = "https://picsum.photos/id/237/200/300"
url_mountain = "https://picsum.photos/seed/picsum/200/300"

# tokenize image urls and text
tokenized = tokenizer.encode_chat_completion(
    ChatCompletionRequest(
        messages=[
            UserMessage(
                content=[
                    TextChunk(text="Can this animal"),
                    ImageURLChunk(image_url=url_dog),
                    TextChunk(text="live here?"),
                    ImageURLChunk(image_url=url_mountain),
                ]
            )
        ],
        model="pixtral",
    )
)
tokens, text, images = tokenized.tokens, tokenized.text, tokenized.images

# Count the number of tokens
print("# tokens", len(tokens))
print("# images", len(images))

ImageData

You can also pass image encoded as base64

tokenized = tokenizer.encode_chat_completion(
    ChatCompletionRequest(
        messages=[
            UserMessage(
                content=[
                    TextChunk(text="What is this?"),
                    ImageURLChunk(image_url="...
Read more

Patch release 1.3.4 - Loosen pydantic requirement

15 Aug 10:11
Compare
Choose a tag to compare

In this patch release the pydantic requirement is loosened to be <= 3.0.0

as noticed in multiple issues, e.g.:

Tekkenizer

18 Jul 14:01
Compare
Choose a tag to compare

Tekkenizer

The new Tekkenizer class is based on Open AI's tiktoken and supports the new Mistral-Nemo model.

Tekkenizer always makes use of version 3 or higher.

Examples:

from mistral_common.tokens.tokenizers.mistral import MistralTokenizer

tokenizer = MistralTokenizer.v3(is_tekken=True)
tokenizer = MistralTokenizer.from_model("...")

Function calling (just like before)

# Import needed packages:
from mistral_common.protocol.instruct.messages import (
    UserMessage,
)
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.protocol.instruct.tool_calls import (
    Function,
    Tool,
)
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer

# Load Mistral tokenizer

model_name = "..."

tokenizer = MistralTokenizer.from_model(model_name)

# Tokenize a list of messages
tokenized = tokenizer.encode_chat_completion(
    ChatCompletionRequest(
        tools=[
            Tool(
                function=Function(
                    name="get_current_weather",
                    description="Get the current weather",
                    parameters={
                        "type": "object",
                        "properties": {
                            "location": {
                                "type": "string",
                                "description": "The city and state, e.g. San Francisco, CA",
                            },
                            "format": {
                                "type": "string",
                                "enum": ["celsius", "fahrenheit"],
                                "description": "The temperature unit to use. Infer this from the users location.",
                            },
                        },
                        "required": ["location", "format"],
                    },
                )
            )
        ],
        messages=[
            UserMessage(content="What's the weather like today in Paris"),
        ],
        model=model_name,
    )
)
tokens, text = tokenized.tokens, tokenized.text

# Count the number of tokens
print(len(tokens))

What's Changed

Full Changelog: v1.3.0...v1.3.1

Patch release: Fix FIM tokenizer

30 May 10:37
Compare
Choose a tag to compare

As noticed here: https://huggingface.co/mistralai/Codestral-22B-v0.1/discussions/10

The wrong tokenizer was used for FIM. This patch release fixes that so that the following works correctly:

from mistral_common.tokens.tokenizers.base import FIMRequest
from mistral_common_private.tokens.tokenizers.mistral import MistralTokenizer
tokenizer =  MistralTokenizer.v3()
tokenized = tokenizer.encode_fim(FIMRequest(prompt="def f(", suffix="return a + b"))
assert tokenized.text == "<s>[SUFFIX]returnโ–aโ–+โ–b[PREFIX]โ–defโ–f("