diff --git a/docs/capabilities/code-generation.mdx b/docs/capabilities/code-generation.mdx
index 6f4f9dd..bea198f 100644
--- a/docs/capabilities/code-generation.mdx
+++ b/docs/capabilities/code-generation.mdx
@@ -236,8 +236,8 @@ curl --location "https://api.mistral.ai/v1/chat/completions" \
-## Codestral-Mamba
-We have also released Codestral-Mamba 7B, a Mamba2 language model specilized in code generation with the instruct endpoint.
+## Codestral Mamba
+We have also released Codestral Mamba 7B, a Mamba2 language model specilized in code generation with the instruct endpoint.
```python
@@ -278,9 +278,9 @@ curl --location "https://api.mistral.ai/v1/chat/completions" \
-## Open-weight Codestral and Codestral-Mamba
+## Open-weight Codestral and Codestral Mamba
Codestral is available open-weight under the [Mistral AI Non-Production (MNPL) License](https://mistral.ai/licences/MNPL-0.1.md) and
-Codestral-Mamba is available open-weight under the Apache 2.0 license.
+Codestral Mamba is available open-weight under the Apache 2.0 license.
Check out the README of [mistral-inference](https://github.com/mistralai/mistral-inference) to learn how to use `mistral-inference` to run Codestral.
diff --git a/docs/getting-started/Open-weight-models.mdx b/docs/getting-started/Open-weight-models.mdx
index edffa8a..5723123 100644
--- a/docs/getting-started/Open-weight-models.mdx
+++ b/docs/getting-started/Open-weight-models.mdx
@@ -6,17 +6,18 @@ sidebar_position: 1.4
We open-source both pre-trained models and fine-tuned models. These models are not tuned for safety as we want to empower users to test and refine moderation based on their use cases. For safer models, follow our [guardrailing tutorial](/capabilities/guardrailing).
-| Model |Open-weight|API| Description | Max Tokens| Endpoint|
+| Model | Available Open-weight|Available via API| Description | Max Tokens| API Endpoints|
|--------------------|:--------------------:|:--------------------:|:--------------------:|:--------------------:|:--------------------:|
-| Mistral 7B | :heavy_check_mark:
Apache2 |:heavy_check_mark: |The first dense model released by Mistral AI, perfect for experimentation, customization, and quick iteration. At the time of the release, it matched the capabilities of models up to 30B parameters. Learn more on our [blog post](https://mistral.ai/news/announcing-mistral-7b/)| 32k | `open-mistral-7b`
(aka `mistral-tiny-2312`)|
-| Mixtral 8x7B |:heavy_check_mark:
Apache2 | :heavy_check_mark: |A sparse mixture of experts model. As such, it leverages up to 45B parameters but only uses about 12B during inference, leading to better inference throughput at the cost of more vRAM. Learn more on the dedicated [blog post](https://mistral.ai/news/mixtral-of-experts/)| 32k | `open-mixtral-8x7b`
(aka `mistral-small-2312`) |
-| Mixtral 8x22B |:heavy_check_mark:
Apache2 | :heavy_check_mark: |A bigger sparse mixture of experts model with larger context window. As such, it leverages up to 141B parameters but only uses about 39B during inference, leading to better inference throughput at the cost of more vRAM. Learn more on the dedicated [blog post](https://mistral.ai/news/mixtral-8x22b/)| 64k | `open-mixtral-8x22b`|
+| Mistral 7B | :heavy_check_mark:
Apache2 |:heavy_check_mark: |The first dense model released by Mistral AI, perfect for experimentation, customization, and quick iteration. At the time of the release, it matched the capabilities of models up to 30B parameters. Learn more on our [blog post](https://mistral.ai/news/announcing-mistral-7b/)| 32k | `open-mistral-7b`|
+| Mixtral 8x7B |:heavy_check_mark:
Apache2 | :heavy_check_mark: |A sparse mixture of experts model. As such, it leverages up to 45B parameters but only uses about 12B during inference, leading to better inference throughput at the cost of more vRAM. Learn more on the dedicated [blog post](https://mistral.ai/news/mixtral-of-experts/)| 32k | `open-mixtral-8x7b`|
+| Mixtral 8x22B |:heavy_check_mark:
Apache2 | :heavy_check_mark: |A bigger sparse mixture of experts model. As such, it leverages up to 141B parameters but only uses about 39B during inference, leading to better inference throughput at the cost of more vRAM. Learn more on the dedicated [blog post](https://mistral.ai/news/mixtral-8x22b/)| 64k | `open-mixtral-8x22b`|
| Codestral |:heavy_check_mark:
MNPL|:heavy_check_mark: | A cutting-edge generative model that has been specifically designed and optimized for code generation tasks, including fill-in-the-middle and code completion | 32k | `codestral-latest`|
-| Codestral-Mamba | :heavy_check_mark: | :heavy_check_mark: | A Mamba 2 language model specialized in code generation. Learn more on our [blog post](https://mistral.ai/news/codestral-mamba/) | 256k | `codestral-mamba-latest`|
-| Mathstral | :heavy_check_mark: | :heavy_check_mark: | A math-specific 7B model designed for math reasoning and scientific tasks. Learn more on our [blog post](https://mistral.ai/news/mathstral/) | 32k | NA|
+| Codestral Mamba | :heavy_check_mark:
Apache2 | :heavy_check_mark: | A Mamba 2 language model specialized in code generation. Learn more on our [blog post](https://mistral.ai/news/codestral-mamba/) | 256k | `open-codestral-mamba`|
+| Mathstral | :heavy_check_mark:
Apache2 | | A math-specific 7B model designed for math reasoning and scientific tasks. Learn more on our [blog post](https://mistral.ai/news/mathstral/) | 32k | NA|
+| Mistral NeMo | :heavy_check_mark:
Apache2 | :heavy_check_mark: | A 12B model built with the partnership with Nvidia. It is easy to use and a drop-in replacement in any system using Mistral 7B that it supersedes. Learn more on our [blog post](https://mistral.ai/news/mistral-nemo/) | 128k | `open-mistral-nemo`|
## License
-- Mistral 7B, Mixtral 8x7B, Mixtral 8x22B, Codestral-Mamba, and Mathstral are under [Apache 2 License](https://choosealicense.com/licenses/apache-2.0/), which permits their use without any constraints.
+- Mistral 7B, Mixtral 8x7B, Mixtral 8x22B, Codestral Mamba, Mathstral, and Mistral NeMo are under [Apache 2 License](https://choosealicense.com/licenses/apache-2.0/), which permits their use without any constraints.
- Codestral is under [Mistral AI Non-Production (MNPL) License](https://mistral.ai/licences/MNPL-0.1.md).
@@ -38,6 +39,8 @@ We open-source both pre-trained models and fine-tuned models. These models are n
| Codestral-22B-v0.1 | [Hugging Face](https://huggingface.co/mistralai/Codestral-22B-v0.1)
[raw_weights](https://models.mistralcdn.com/codestral-22b-v0-1/codestral-22B-v0.1.tar) (md5sum: `1ea95d474a1d374b1d1b20a8e0159de3`) | - 32768 vocabulary size
- Supports v3 Tokenizer |
| Codestral-Mamba-7B-v0.1 | [Hugging Face](https://huggingface.co/mistralai/mamba-codestral-7B-v0.1)
[raw_weights](https://models.mistralcdn.com/codestral-mamba-7b-v0-1/codestral-mamba-7B-v0.1.tar)(md5sum: `d3993e4024d1395910c55db0d11db163`) | - 32768 vocabulary size
- Supports v3 Tokenizer |
| Mathstral-7B-v0.1 | [Hugging Face](https://huggingface.co/mistralai/mathstral-7B-v0.1)
[raw_weights](https://models.mistralcdn.com/mathstral-7b-v0-1/mathstral-7B-v0.1.tar)(md5sum: `5f05443e94489c261462794b1016f10b`) | - 32768 vocabulary size
- Supports v3 Tokenizer |
+| Mistral-NeMo-Base-2407 | [Hugging Face](https://huggingface.co/mistralai/Mistral-Nemo-Base-2407)
[raw_weights](https://models.mistralcdn.com/mistral-nemo-2407/mistral-nemo-base-2407.tar)(md5sum: `c5d079ac4b55fc1ae35f51f0a3c0eb83`) | - 131k vocabulary size
- Supports tekken.json tokenizer |
+| Mistral-NeMo-Instruct-2407 | [Hugging Face](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407)
[raw_weights](https://models.mistralcdn.com/mistral-nemo-2407/mistral-nemo-instruct-2407.tar)(md5sum: `296fbdf911cb88e6f0be74cd04827fe7`) | - 131k vocabulary size
- Supports tekken.json tokenizer
- Supports function calling |
## Sizes
@@ -50,6 +53,7 @@ We open-source both pre-trained models and fine-tuned models. These models are n
| Codestral-22B-v0.1 | 22.2B | 22.2B | 60 |
| Codestral-Mamba-7B-v0.1 | 7.3B | 7.3B | 16 |
| Mathstral-7B-v0.1 | 7.3B | 7.3B | 16 |
+| Mistral-NeMo-12B-v0.1 | 12B | 12B | 28 - bf16
16 - fp8 |
## How to run?
Check out [mistral-inference](https://github.com/mistralai/mistral-inference/), a Python package for running our models. You can install `mistral-inference` by
diff --git a/docs/getting-started/changelog.mdx b/docs/getting-started/changelog.mdx
index d63471c..30c14ee 100644
--- a/docs/getting-started/changelog.mdx
+++ b/docs/getting-started/changelog.mdx
@@ -6,8 +6,11 @@ sidebar_position: 1.8
This is the list of changes to the Mistral API.
+July 18, 2024
+- We released Mistral NeMo (`open-mistral-nemo`).
+
July 16, 2024
-- We released Codestral-Mamba and Mathstral.
+- We released Codestral Mamba (`open-codestral-mamba`) and Mathstral.
Jun 5, 2024
- We released fine-tuning API. Check out the [capability docs](/capabilities/finetuning/) and [guides](/guides/finetuning/).
diff --git a/docs/getting-started/introduction.mdx b/docs/getting-started/introduction.mdx
index 18fd855..9404e97 100644
--- a/docs/getting-started/introduction.mdx
+++ b/docs/getting-started/introduction.mdx
@@ -17,6 +17,9 @@ We release both open source and commercial models, driving innovation and conven
- Mistral 7b, our first dense model released [September 2023](https://mistral.ai/news/announcing-mistral-7b/)
- Mixtral 8x7b, our first sparse mixture-of-experts released [December 2023](https://mistral.ai/news/mixtral-of-experts/)
- Mixtral 8x22b, our best open source model to date released [April 2024](https://mistral.ai/news/mixtral-8x22b/)
+- Mathstral 7b, our first math open source model released [July 2024](https://mistral.ai/news/mathstral/)
+- Codestral Mamba 7b, our first mamba 2 open source model released [July 2024](https://mistral.ai/news/codestral-mamba/)
+- Mistral NeMo 7b, our best multilingual open source model released [July 2024](https://mistral.ai/news/mistral-nemo/)
### Commercial
diff --git a/docs/getting-started/models.mdx b/docs/getting-started/models.mdx
index 7d3606a..f46d383 100644
--- a/docs/getting-started/models.mdx
+++ b/docs/getting-started/models.mdx
@@ -21,8 +21,9 @@ They are ideal for customization, such as fine-tuning, due to their portability,
| Mistral Large || :heavy_check_mark: |Our flagship model that's ideal for complex tasks that require large reasoning capabilities or are highly specialized (Synthetic Text Generation, Code Generation, RAG, or Agents). Learn more on our [blog post](https://mistral.ai/news/mistral-large/)| 32k | `mistral-large-latest`|
| Mistral Embeddings ||:heavy_check_mark: | A model that converts text into numerical vectors of embeddings in 1024 dimensions. Embedding models enable retrieval and retrieval-augmented generation applications. It achieves a retrieval score of 55.26 on MTEB | 8k | `mistral-embed`|
| Codestral |:heavy_check_mark:
MNPL|:heavy_check_mark: | A cutting-edge generative model that has been specifically designed and optimized for code generation tasks, including fill-in-the-middle and code completion | 32k | `codestral-latest`|
-| Codestral-Mamba | :heavy_check_mark: | :heavy_check_mark: | A Mamba 2 language model specialized in code generation. Learn more on our [blog post](https://mistral.ai/news/codestral-mamba/) | 256k | `codestral-mamba-latest`|
-| Mathstral | :heavy_check_mark: | :heavy_check_mark: | A math-specific 7B model designed for math reasoning and scientific tasks. Learn more on our [blog post](https://mistral.ai/news/mathstral/) | 32k | NA|
+| Codestral Mamba | :heavy_check_mark:
Apache2 | :heavy_check_mark: | A Mamba 2 language model specialized in code generation. Learn more on our [blog post](https://mistral.ai/news/codestral-mamba/) | 256k | `open-codestral-mamba`|
+| Mathstral | :heavy_check_mark:
Apache2 | | A math-specific 7B model designed for math reasoning and scientific tasks. Learn more on our [blog post](https://mistral.ai/news/mathstral/) | 32k | NA|
+| Mistral NeMo | :heavy_check_mark:
Apache2 | :heavy_check_mark: | A 12B model built with the partnership with Nvidia. It is easy to use and a drop-in replacement in any system using Mistral 7B that it supersedes. Learn more on our [blog post](https://mistral.ai/news/mistral-nemo/) | 128k | `open-mistral-nemo`|
## Pricing
@@ -36,18 +37,12 @@ it is recommended to use the dated versions of the Mistral AI API.
Additionally, be prepared for the deprecation of certain endpoints in the coming months.
Here are the details of the available versions:
-- `open-mistral-7b`: currently points to `mistral-tiny-2312`.
-It used to be called `mistral-tiny`, which will be deprecated shortly.
-- `open-mixtral-8x7b`: currently points to `mistral-small-2312`.
-It used to be called `mistral-small`, which will be deprecated shortly.
-- `open-mixtral-8x22b` points to `open-mixtral-8x22b-2404`.
- `mistral-small-latest`: currently points to `mistral-small-2402`.
- `mistral-medium-latest`: currently points to `mistral-medium-2312`.
The previous `mistral-medium` has been dated and tagged as `mistral-medium-2312`.
Mistral Medium will be deprecated shortly.
- `mistral-large-latest`: currently points to `mistral-large-2402`.
- `codestral-latest`: currently points to `codestral-2405`.
-- `codestral-mamba-latest`: currently points to `codestral-mamba-2407`.
## Benchmarks results
Mistral ranks second among all models generally available through an API.
@@ -64,6 +59,8 @@ It can be used for complex multilingual reasoning tasks, including text understa
- [Codestral](https://mistral.ai/news/codestral/): as a 22B model, Codestral sets a new standard on the performance/latency space for code generation compared to previous models used for coding.
- [Codestral-Mamba](https://mistral.ai/news/codestral-mamba/): we have trained this model with advanced code and reasoning capabilities, enabling the model to have a strong performance on par with SOTA transformer-based models.
- [Mathstral](https://mistral.ai/news/mathstral/): Mathstral stands on the shoulders of Mistral 7B and specialises in STEM subjects. It achieves state-of-the-art reasoning capacities in its size category across various industry-standard benchmarks.
+- [Mistral NeMo](https://mistral.ai/news/mistral-nemo/): Mistral NeMo's reasoning, world knowledge, and coding performance are state-of-the-art in its size category. As it relies on standard architecture, Mistral NeMo is easy to use and a drop-in replacement in any system using Mistral 7B that it supersedes.
+
## Picking a model
diff --git a/docs/guides/tokenization.mdx b/docs/guides/tokenization.mdx
index 866c9d8..d477bdd 100644
--- a/docs/guides/tokenization.mdx
+++ b/docs/guides/tokenization.mdx
@@ -33,25 +33,160 @@ We have released three versions of our tokenizers powering different sets of mod
- v1: `open-mistral-7b`, `open-mixtral-8x7b`, `mistral-embed`
- v2: `mistral-small-latest`, `mistral-large-latest`
- v3: `open-mixtral-8x22b`
+- v3 (tekken): `open-mistral-nemo`
+This guide will focus on our latest v3 (tekken) tokenizer and v3 tokenizer.
-Below is an example of 3 turns of conversations using v3, showcasing how the control tokens are structured around the available tools, user message, tool calls and tool results. This guide will focus on our latest v3 tokenizer.
-
+## v3 (tekken) tokenizer
-## Building the vocabulary
-There are several tokenization methods used in Natural Language Processing (NLP) to convert raw text into tokens such as word-level tokenization, character-level tokenization, and subword-level tokenization including the Byte-Pair Encoding (BPE). Our tokenizers use the Byte-Pair Encoding (BPE) with SentencePiece, which is an open-source tokenization library to build our tokenization vocabulary.
+There are several tokenization methods used in Natural Language Processing (NLP) to convert raw text into tokens such as word-level tokenization, character-level tokenization, and subword-level tokenization including the Byte-Pair Encoding (BPE).
+Our newest tokenizer, tekken, uses the Byte-Pair Encoding (BPE) with [Tiktoken](https://github.com/openai/tiktoken).
+
+
+Tekken was trained on more than 100 languages and compresses natural language text and
+source code more efficiently than the SentencePiece tokeniser used in previous Mistral models.
+In particular, it is ~30% more efficient at compressing source code in Chinese, Italian,
+French, German, Spanish, and Russian. It is also 2x and 3x more efficient at compressing
+Korean and Arabic, respectively. Compared to the Llama 3 tokeniser,
+Tekken proved more proficient in compressing text for approximately 85% of all languages.
+
+
+
+
+### Our tokenization vocabulary
+Our tokenization vocabulary is released in the https://github.com/mistralai/mistral-common/tree/main/tests/data folder. Let’s take a look at the vocabulary of our v3 tekken tokenizer.
+
+#### Vocabulary size
+Our vocabulary consists of 130k vocab + 1k control tokens. We can use up to 131k tokens and we current use 128k tokens.
+
+#### Control tokens
+Our vocabulary starts with 14 control tokens, which are special tokens we use in the encoding process to represent specific instructions or indicators:
+
+```
+
+
+
+[INST]
+[/INST]
+[AVAILABLE_TOOLS]
+[/AVAILABLE_TOOLS]
+[TOOL_RESULTS]
+[/TOOL_RESULTS]
+[TOOL_CALLS]
+
+[PREFIX]
+[MIDDLE]
+[SUFFIX]
+```
+
+The tokenizer does not encode control tokens, which help prevent a situation known as prompt injection. For example, the control token “[INST]” is used to denote user message:
+- Without the control tokens, the tokenizer treats “[INST]” as a regular string and encodes the entire sequence “[INST] I love Paris [/INST]”. This could potentially allow users to include "[INST]" and "[/INST]" tags within their message, causing confusion for the model as it might interpret part of the user's message as an assistant's message.
+- With the control tokens, the tokenizer instead concatenates the control tokens with the encoded message: [INST] + encode(“I love Paris”) + [/INST]. This ensures that only the user's message gets encoded, and the encoded messages are guaranteed to have the correct [INST] and [/INST] tags.
+
+You may have noticed that we have 1000 slots for control tokens. The remaining 1000-14=986 slots for control tokens are actually empty for us to add more control tokens in the future and also ensure our vocabulary size is 131k (2\^17). Computers like powers of 2s!
+
+#### Bytes, characters, and merged characters
+
+Below are two examples of the vocab. token_str is null when the byte sequence doesn't decode into a full unicode character, e.g., raw bytes.
+```
+{
+ "rank": 0,
+ "token_bytes": "AA==",
+ "token_str": "\u0000"
+},
+...
+{
+ "rank": 7613,
+ "token_bytes": "IO2D",
+ "token_str": null
+},
+```
+
+### Run our tokenizer in Python
+To get started, let’s first install our tokenizer and tiktoken via `pip install mistral-common tiktoken`.
+
+Once the tokenizer is installed, in a Python environment, we can import the needed modules from `mistral_common`.
+
+```py
+from mistral_common.protocol.instruct.messages import (
+ UserMessage,
+)
+from mistral_common.protocol.instruct.request import ChatCompletionRequest
+from mistral_common.protocol.instruct.tool_calls import (
+ Function,
+ Tool,
+)
+from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
+```
+
+We then can load our tokenizer.
+```py
+tokenizer = MistralTokenizer.v3(is_tekken=True)
+model_name = "nemostral"
+tokenizer = MistralTokenizer.from_model(model_name)
+```
+
+Let’s tokenize a series of conversation with different types of messages.
+```py
+# Tokenize a list of messages
+tokenized = tokenizer.encode_chat_completion(
+ ChatCompletionRequest(
+ tools=[
+ Tool(
+ function=Function(
+ name="get_current_weather",
+ description="Get the current weather",
+ parameters={
+ "type": "object",
+ "properties": {
+ "location": {
+ "type": "string",
+ "description": "The city and state, e.g. San Francisco, CA",
+ },
+ "format": {
+ "type": "string",
+ "enum": ["celsius", "fahrenheit"],
+ "description": "The temperature unit to use. Infer this from the users location.",
+ },
+ },
+ "required": ["location", "format"],
+ },
+ )
+ )
+ ],
+ messages=[
+ UserMessage(content="What's the weather like today in Paris"),
+ ],
+ model=model_name,
+ )
+)
+tokens, text = tokenized.tokens, tokenized.text
+
+```
+
+Here is the output of “text”, which is a debug representation for you to inspect.
+
+```
+[AVAILABLE_TOOLS][{"type": "function", "function": {"name": "get_current_weather", "description": "Get the current weather", "parameters": {"type": "object", "properties": {"location": {"type": "string", "description": "The city and state, e.g. San Francisco, CA"}, "format": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "The temperature unit to use. Infer this from the users location."}}, "required": ["location", "format"]}}}][/AVAILABLE_TOOLS][INST]What's the weather like today in Paris[/INST]
+```
+
+To count the number of tokens, run `len(tokens)` and we get 128 tokens.
+
+## v3 tokenizer
+
+Our v3 tokenizer uses the Byte-Pair Encoding (BPE) with SentencePiece, which is an open-source tokenization library to build our tokenization vocabulary.
In BPE, the tokenization process starts by treating each byte in a text as a separate token.
Then, it iteratively adds new tokens to the vocabulary for the most frequent pair of tokens currently appearing in the corpus. For example, if the most frequent pair of tokens is "th" + "e", then a new token "the" will be created and occurrences of "th"+"e" will be replaced with the new token "the". This process continues until no more replacements can be made.
-## Our tokenization vocabulary
+### Our tokenization vocabulary
Our tokenization vocabulary is released in the https://github.com/mistralai/mistral-common/tree/main/tests/data folder. Let’s take a look at the vocabulary of our v3 tokenizer.
-### Vocabulary size
+#### Vocabulary size
Our vocabulary consists of 32k vocab + 768 control tokens. The 32k vocab includes 256 bytes and 31,744 characters and merged characters.
-### Control tokens
+#### Control tokens
Our vocabulary starts with 10 control tokens, which are special tokens we use in the encoding process to represent specific instructions or indicators:
```
@@ -67,14 +202,7 @@ Our vocabulary starts with 10 control tokens, which are special tokens we use in
[/TOOL_RESULTS]
```
-
-The tokenizer does not encode control tokens, which help prevent a situation known as prompt injection. For example, the control token “[INST]” is used to denote user message:
-- Without the control tokens, the tokenizer treats “[INST]” as a regular string and encodes the entire sequence “[INST] I love Paris [/INST]”. This could potentially allow users to include "[INST]" and "[/INST]" tags within their message, causing confusion for the model as it might interpret part of the user's message as an assistant's message.
-- With the control tokens, the tokenizer instead concatenates the control tokens with the encoded message: [INST] + encode(“I love Paris”) + [/INST]. This ensures that only the user's message gets encoded, and the encoded messages are guaranteed to have the correct [INST] and [/INST] tags.
-
-You may have noticed that we have 768 slots for control tokens. The remaining 761 slots for control tokens are actually empty for us to add more control tokens in the future and also ensure our vocabulary size is 32,768 (2\^15). Computers like powers of 2s!
-
-### Bytes
+#### Bytes
After the control token slots, we have 256 bytes in the vocabulary. A byte is a unit of digital information that consists of 8 bits. Each bit can represent one of two values, either 0 or 1. A byte can therefore represent 256 different values.
```
@@ -85,7 +213,7 @@ After the control token slots, we have 256 bytes in the vocabulary. A byte is a
Any character, regardless of the language or symbol, can be represented by a sequence of one or more bytes. When a word is not present in the vocabulary, it can still be represented by the bytes that correspond to its individual characters. This is important for handling unknown words and characters.
-### Characters and merged characters
+#### Characters and merged characters
And finally, we have the characters and merged characters in the vocabulary. The order of the tokens are determined by the frequency of these tokens in the data that was used to train the model, with the most frequent ones in the beginning of the vocabulary. For example, two spaces “▁”, four spaces “▁▁▁▁”, “_t”, “in”, and “er” were found to be the most common tokens we trained on. As we move further down the vocabulary list, the tokens become less frequent. Towards the end of the vocabulary file, you might find less common characters such as Chinese and Korean characters. These characters are less frequent because they were encountered less often in the training data, not because they are less used in general.
```
@@ -100,7 +228,7 @@ er
梦
```
-## Run our tokenizer in Python
+### Run our tokenizer in Python
To get started, let’s first install our tokenizer via `pip install mistral-common`.
diff --git a/static/img/guides/tokenization3.png b/static/img/guides/tokenization3.png
new file mode 100644
index 0000000..78732ad
Binary files /dev/null and b/static/img/guides/tokenization3.png differ