Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 33 additions & 36 deletions docs/ai/conceptual/understanding-tokens.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,34 +4,31 @@ description: "Understand how large language models (LLMs) use tokens to analyze
author: haywoodsloan
ms.topic: concept-article
ms.date: 12/19/2024

#customer intent: As a .NET developer, I want understand how large language models (LLMs) use tokens so I can add semantic analysis and text generation capabilities to my .NET projects.

---

# Understand tokens

Tokens are words, character sets, or combinations of words and punctuation that are generated by large language models (LLMs) when they decompose text. Tokenization is the first step in training. The LLM analyzes the semantic relationships between tokens, such as how commonly they're used together or whether they're used in similar contexts. After training, the LLM uses those patterns and relationships to generate a sequence of output tokens based on the input sequence.

## Turning text into tokens
## Turn text into tokens

The set of unique tokens that an LLM is trained on is known as its _vocabulary_.

For example, consider the following sentence:

> I heard a dog bark loudly at a cat
> `I heard a dog bark loudly at a cat`

This text could be tokenized as:

- I
- heard
- a
- dog
- bark
- loudly
- at
- a
- cat
- `I`
- `heard`
- `a`
- `dog`
- `bark`
- `loudly`
- `at`
- `a`
- `cat`

By having a sufficiently large set of training text, tokenization can compile a vocabulary of many thousands of tokens.

Expand All @@ -47,37 +44,37 @@ For example, the GPT models, developed by OpenAI, use a type of subword tokeniza

There are benefits and disadvantages to each tokenization method:

| Token size | Pros | Cons |
| -------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Smaller tokens (character or subword tokenization) | - Enables the model to handle a wider range of inputs, such as unknown words, typos, or complex syntax.<br>- Might allow the vocabulary size to be reduced, requiring fewer memory resources. | - A given text is broken into more tokens, requiring additional computational resources while processing<br>- Given a fixed token limit, the maximum size of the model's input and output is smaller |
| Larger tokens (word tokenization) | - A given text is broken into fewer tokens, requiring fewer computational resources while processing.<br>- Given the same token limit, the maximum size of the model's input and output is larger. | - Might cause an increased vocabulary size, requiring more memory resources.<br>- Can limit the models ability to handle unknown words, typos, or complex syntax. |
| Token size | Pros | Cons |
|----------------------------------------------------|------|------|
| Smaller tokens (character or subword tokenization) | - Enables the model to handle a wider range of inputs, such as unknown words, typos, or complex syntax.<br>- Might allow the vocabulary size to be reduced, requiring fewer memory resources. | - A given text is broken into more tokens, requiring additional computational resources while processing.<br>- Given a fixed token limit, the maximum size of the model's input and output is smaller. |
| Larger tokens (word tokenization) | - A given text is broken into fewer tokens, requiring fewer computational resources while processing.<br>- Given the same token limit, the maximum size of the model's input and output is larger. | - Might cause an increased vocabulary size, requiring more memory resources.<br>- Can limit the models ability to handle unknown words, typos, or complex syntax. |

## How LLMs use tokens

After the LLM completes tokenization, it assigns an ID to each unique token.

Consider our example sentence:

> I heard a dog bark loudly at a cat
> `I heard a dog bark loudly at a cat`

After the model uses a word tokenization method, it could assign token IDs as follows:

- I (1)
- heard (2)
- a (3)
- dog (4)
- bark (5)
- loudly (6)
- at (7)
- a (the "a" token is already assigned an ID of 3)
- cat (8)
- `I` (1)
- `heard` (2)
- `a` (3)
- `dog` (4)
- `bark` (5)
- `loudly` (6)
- `at` (7)
- `a` (the "a" token is already assigned an ID of 3)
- `cat` (8)

By assigning IDs, text can be represented as a sequence of token IDs. The example sentence would be represented as [1, 2, 3, 4, 5, 6, 7, 3, 8]. The sentence "I heard a cat" would be represented as [1, 2, 3, 8].
By assigning IDs, text can be represented as a sequence of token IDs. The example sentence would be represented as [1, 2, 3, 4, 5, 6, 7, 3, 8]. The sentence "`I heard a cat`" would be represented as [1, 2, 3, 8].

As training continues, the model adds any new tokens in the training text to its vocabulary and assigns it an ID. For example:

- meow (9)
- run (10)
- `meow` (9)
- `run` (10)

The semantic relationships between the tokens can be analyzed by using these token ID sequences. Multi-valued numeric vectors, known as [embeddings](embeddings.md), are used to represent these relationships. An embedding is assigned to each token based on how commonly it's used together with, or in similar contexts to, the other tokens.

Expand All @@ -91,9 +88,9 @@ Output generation is an iterative operation. The model appends the predicted tok

LLMs have limitations regarding the maximum number of tokens that can be used as input or generated as output. This limitation often causes the input and output tokens to be combined into a maximum context window. Taken together, a model's token limit and tokenization method determine the maximum length of text that can be provided as input or generated as output.

For example, consider a model that has a maximum context window of 100 tokens. The model processes our example sentences as input text:
For example, consider a model that has a maximum context window of 100 tokens. The model processes the example sentences as input text:

> I heard a dog bark loudly at a cat
> `I heard a dog bark loudly at a cat`

By using a word-based tokenization method, the input is nine tokens. This leaves 91 **word** tokens available for the output.

Expand All @@ -107,6 +104,6 @@ Generative AI services might also be limited regarding the maximum number of tok

## Related content

- [How Generative AI and LLMs work](how-genai-and-llms-work.md)
- [Understanding embeddings](embeddings.md)
- [Working with vector databases](vector-databases.md)
- [How generative AI and LLMs work](how-genai-and-llms-work.md)
- [Understand embeddings](embeddings.md)
- [Work with vector databases](vector-databases.md)