Skip to content

Commit

Permalink
.Net: Add new token counter implementations to TextChunker (#2840)
Browse files Browse the repository at this point in the history
Implement MicrosoftML and DeepDev token counters in the TextChunker
example. Update the project file with new package references and modify
the RunExampleWithCustomTokenCounter method to support different token
counter types.

Inspired by #2809 

Fixes #478 

| Iteration | MicrosoftML (ms) | MicrosoftMLRoberta (ms) | SharpToken
(ms) | DeepDev (ms) |

|------------|-------------------|--------------------------|-----------------|--------------|
| 1 | 38 | 10,189 | 14,305 | 16,701 |
| 2 | 36 | 5,581 | 8,381 | 14,214 |
| 3 | 13 | 5,354 | 7,955 | 13,630 |
| 4 | 27 | 5,679 | 9,156 | 16,164 |
| 5 | 16 | 5,158 | 8,657 | 17,276 |
| Average | 26.0 | 7,512.2 | 9,710.8 | 15,597 |




### SharpToken
<sup style="font-size: smaller;">(Avg. Execution Time: 9,710.8 ms)</sup>
```
The city of Venice, located in the northeastern part of Italy,
is renowned for its unique geographical features. Built on more than 100 small islands in a lagoon in the
Adriatic Sea, it has no roads, just canals including the Grand Canal thoroughfare lined with Renaissance and
Gothic palaces. The central square, Piazza San Marco, contains St. Mark's Basilica, which is tiled with Byzantine
mosaics, and the Campanile bell tower offering views of the city's red roofs.
------------------------
The Amazon Rainforest, also known as Amazonia, is a moist broadleaf tropical rainforest in the Amazon biome that
covers most of the Amazon basin of South America. This basin encompasses 7 million square kilometers, of which
5.5 million square kilometers are covered by the rainforest. This region includes territory belonging to nine nations
and 3.4 million square kilometers of uncontacted tribes. The Amazon represents over half of the planet's remaining
rainforests and comprises the largest and most biodiverse tract of tropical rainforest in the world.
------------------------
The Great Barrier Reef is the world's largest coral reef system composed of over 2,900 individual reefs and 900 islands
stretching for over 2,300 kilometers over an area of approximately 344,400 square kilometers. The reef is located in the
Coral Sea, off the coast of Queensland, Australia. The Great Barrier Reef can be seen from outer space and is the world's
biggest single structure made by living organisms. This reef structure is composed of and built by billions of tiny organisms,
known as coral polyps.
```

### MicrosoftML
<sup style="font-size: smaller;">(Avg. Execution Time: 26.0 ms)</sup>
```

The city of Venice,
located in the northeastern part of Italy,
is renowned for its unique
geographical features.
Built on more than 100 small
------------------------
islands in a lagoon in the
Adriatic Sea, it has no roads,
just canals including the Grand Canal
thoroughfare lined with Renaissance and
------------------------
Gothic palaces.
The central square,
Piazza San Marco, contains St.
Mark's Basilica, which is tiled with Byzantine
mosaics,
------------------------
and the Campanile bell tower offering
views of the city's red roofs.
The Amazon Rainforest, also known as Amazonia,
------------------------
is a moist broadleaf tropical
rainforest in the Amazon biome that
covers most of the Amazon
basin of South America.
This basin encompasses 7
------------------------
million square kilometers,
of which
5.
5 million square kilometers
are covered by the rainforest.
This region includes territory
------------------------
belonging to nine nations
and 3.
4 million square kilometers
of uncontacted tribes.
The Amazon represents over
------------------------
half of the planet's remaining
rainforests and comprises the largest and most
biodiverse tract of tropical
rainforest in the world.
------------------------
The Great Barrier Reef is the world's
largest coral reef system composed of over 2,
900 individual reefs and 900 islands
------------------------
stretching for over 2,
300 kilometers over an
area of approximately 344,
400 square kilometers.
The reef is located in the
------------------------
Coral Sea, off the coast of Queensland,
Australia.
The Great Barrier Reef can be seen
from outer space and is the world's
------------------------
biggest single structure
made by living organisms.
This reef structure is composed of and
built by billions of tiny organisms, known as coral polyps.
```

### MicrosoftMLRoberta
<sup style="font-size: smaller;">(Avg. Execution Time: 7,512.2 ms)</sup>
```
The city of Venice, located in the northeastern part of Italy,
is renowned for its unique geographical features. Built on more than 100 small islands in a lagoon in the
Adriatic Sea, it has no roads, just canals including the Grand Canal thoroughfare lined with Renaissance and
Gothic palaces. The central square, Piazza San Marco, contains St. Mark's Basilica, which is tiled with Byzantine
mosaics, and the Campanile bell tower offering views of the city's red roofs.
------------------------
The Amazon Rainforest, also known as Amazonia, is a moist broadleaf tropical rainforest in the Amazon biome that
covers most of the Amazon basin of South America. This basin encompasses 7 million square kilometers, of which
5.5 million square kilometers are covered by the rainforest. This region includes territory belonging to nine nations
and 3.4 million square kilometers of uncontacted tribes. The Amazon represents over half of the planet's remaining
rainforests and comprises the largest and most biodiverse tract of tropical rainforest in the world.
------------------------
The Great Barrier Reef is the world's largest coral reef system composed of over 2,900 individual reefs and 900 islands
stretching for over 2,300 kilometers over an area of approximately 344,400 square kilometers. The reef is located in the
Coral Sea, off the coast of Queensland, Australia. The Great Barrier Reef can be seen from outer space and is the world's
biggest single structure made by living organisms. This reef structure is composed of and built by billions of tiny organisms,
known as coral polyps.
```

### DeepDev
<sup style="font-size: smaller;">(Avg. Execution Time: 15,597 ms)</sup>
```
The city of Venice, located in the northeastern part of Italy,
is renowned for its unique geographical features. Built on more than 100 small islands in a lagoon in the
Adriatic Sea, it has no roads, just canals including the Grand Canal thoroughfare lined with Renaissance and
Gothic palaces. The central square, Piazza San Marco, contains St. Mark's Basilica, which is tiled with Byzantine
mosaics, and the Campanile bell tower offering views of the city's red roofs.
------------------------
The Amazon Rainforest, also known as Amazonia, is a moist broadleaf tropical rainforest in the Amazon biome that
covers most of the Amazon basin of South America. This basin encompasses 7 million square kilometers, of which
5.5 million square kilometers are covered by the rainforest. This region includes territory belonging to nine nations
and 3.4 million square kilometers of uncontacted tribes. The Amazon represents over half of the planet's remaining
rainforests and comprises the largest and most biodiverse tract of tropical rainforest in the world.
------------------------
The Great Barrier Reef is the world's largest coral reef system composed of over 2,900 individual reefs and 900 islands
stretching for over 2,300 kilometers over an area of approximately 344,400 square kilometers. The reef is located in the
Coral Sea, off the coast of Queensland, Australia. The Great Barrier Reef can be seen from outer space and is the world's
biggest single structure made by living organisms. This reef structure is composed of and built by billions of tiny organisms,
known as coral polyps.
```

### Contribution Checklist

<!-- Before submitting this PR, please make sure: -->

- [ ] The code builds clean without any errors or warnings
- [ ] The PR follows the [SK Contribution
Guidelines](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md)
and the [pre-submission formatting
script](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md#development-scripts)
raises no violations
- [ ] All unit tests pass, and I have added new tests where possible
- [ ] I didn't break anyone 😄
  • Loading branch information
lemillermicrosoft authored Sep 22, 2023
1 parent 544b6c1 commit 4a2cf70
Show file tree
Hide file tree
Showing 8 changed files with 150,639 additions and 10 deletions.
2 changes: 2 additions & 0 deletions .github/_typos.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ extend-exclude = [
"_typos.toml",
"package-lock.json",
"*.bicep",
"encoder.json",
"vocab.bpe",
"CodeTokenizerTests.cs",
"test_code_tokenizer.py",
]
Expand Down
5 changes: 4 additions & 1 deletion dotnet/Directory.Packages.props
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,13 @@
<PackageVersion Include="Microsoft.Bcl.AsyncInterfaces" Version="6.0.0" />
<PackageVersion Include="Microsoft.Extensions.Http" Version="6.0.0" />
<PackageVersion Include="Polly" Version="7.2.4" />
<PackageVersion Include="SharpToken" Version="1.2.12" />
<PackageVersion Include="System.Diagnostics.DiagnosticSource" Version="6.0.1" />
<PackageVersion Include="System.Linq.Async" Version="6.0.1" />
<PackageVersion Include="System.Text.Json" Version="6.0.8" />
<!-- Tokenizers -->
<PackageVersion Include="Microsoft.ML.Tokenizers" Version="0.21.0-preview.23266.6" />
<PackageVersion Include="Microsoft.DeepDev.TokenizerLib" Version="1.3.2" />
<PackageVersion Include="SharpToken" Version="1.2.12" />
<!-- Microsoft.Extensions.Logging -->
<PackageVersion Include="Microsoft.Extensions.Logging.Abstractions" Version="6.0.4" />
<PackageVersion Include="Microsoft.Extensions.Logging.Console" Version="6.0.0" />
Expand Down
104 changes: 97 additions & 7 deletions dotnet/samples/KernelSyntaxExamples/Example55_TextChunker.cs
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,15 @@

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Threading.Tasks;
using Microsoft.DeepDev;
using Microsoft.ML.Tokenizers;
using Microsoft.SemanticKernel.Text;
using Resources;
using SharpToken;
using static Microsoft.SemanticKernel.Text.TextChunker;

// ReSharper disable once InconsistentNaming
public static class Example55_TextChunker
Expand All @@ -30,7 +36,10 @@ rainforests and comprises the largest and most biodiverse tract of tropical rain
public static Task RunAsync()
{
RunExample();
RunExampleWithCustomTokenCounter();
RunExampleForTokenCounterType(TokenCounterType.SharpToken);
RunExampleForTokenCounterType(TokenCounterType.MicrosoftML);
RunExampleForTokenCounterType(TokenCounterType.MicrosoftMLRoberta);
RunExampleForTokenCounterType(TokenCounterType.DeepDev);
RunExampleWithHeader();

return Task.CompletedTask;
Expand All @@ -46,13 +55,18 @@ private static void RunExample()
WriteParagraphsToConsole(paragraphs);
}

private static void RunExampleWithCustomTokenCounter()
private static void RunExampleForTokenCounterType(TokenCounterType counterType)
{
Console.WriteLine("=== Text chunking with a custom token counter ===");
Console.WriteLine($"=== Text chunking with a custom({counterType}) token counter ===");
var sw = new Stopwatch();
sw.Start();
var tokenCounter = s_tokenCounterFactory(counterType);

var lines = TextChunker.SplitPlainTextLines(Text, 40, CustomTokenCounter);
var paragraphs = TextChunker.SplitPlainTextParagraphs(lines, 120, tokenCounter: CustomTokenCounter);
var lines = TextChunker.SplitPlainTextLines(Text, 40, tokenCounter);
var paragraphs = TextChunker.SplitPlainTextParagraphs(lines, 120, tokenCounter: tokenCounter);

sw.Stop();
Console.WriteLine($"Elapsed time: {sw.ElapsedMilliseconds} ms");
WriteParagraphsToConsole(paragraphs);
}

Expand All @@ -79,11 +93,19 @@ private static void WriteParagraphsToConsole(List<string> paragraphs)
}
}

private enum TokenCounterType
{
SharpToken,
MicrosoftML,
DeepDev,
MicrosoftMLRoberta,
}

/// <summary>
/// Custom token counter implementation using SharpToken.
/// Note: SharpToken is used for demonstration purposes only, it's possible to use any available or custom tokenization logic.
/// </summary>
private static int CustomTokenCounter(string input)
private static TokenCounter SharpTokenTokenCounter => (string input) =>
{
// Initialize encoding by encoding name
var encoding = GptEncoding.GetEncoding("cl100k_base");
Expand All @@ -94,5 +116,73 @@ private static int CustomTokenCounter(string input)
var tokens = encoding.Encode(input);
return tokens.Count;
}
};

/// <summary>
/// MicrosoftML token counter implementation.
/// </summary>
private static TokenCounter MicrosoftMLTokenCounter => (string input) =>
{
Tokenizer tokenizer = new(new Bpe());
var tokens = tokenizer.Encode(input).Tokens;
return tokens.Count;
};

/// <summary>
/// MicrosoftML token counter implementation using Roberta and local vocab
/// </summary>
private static TokenCounter MicrosoftMLRobertaTokenCounter => (string input) =>
{
var encoder = EmbeddedResource.ReadStream("EnglishRoberta.encoder.json");
var vocab = EmbeddedResource.ReadStream("EnglishRoberta.vocab.bpe");
var dict = EmbeddedResource.ReadStream("EnglishRoberta.dict.txt");
if (encoder is null || vocab is null || dict is null)
{
throw new FileNotFoundException("Missing required resources");
}
EnglishRoberta model = new(encoder, vocab, dict);
model.AddMaskSymbol(); // Not sure what this does, but it's in the example
Tokenizer tokenizer = new(model, new RobertaPreTokenizer());
var tokens = tokenizer.Encode(input).Tokens;
return tokens.Count;
};

/// <summary>
/// DeepDev token counter implementation.
/// </summary>
private static TokenCounter DeepDevTokenCounter => (string input) =>
{
#pragma warning disable VSTHRD002 // Avoid problematic synchronous waits
// Initialize encoding by encoding name
var tokenizer = TokenizerBuilder.CreateByEncoderNameAsync("cl100k_base").GetAwaiter().GetResult();
#pragma warning restore VSTHRD002 // Avoid problematic synchronous waits
// Initialize encoding by model name
// var tokenizer = TokenizerBuilder.CreateByModelNameAsync("gpt-4").GetAwaiter().GetResult();
var tokens = tokenizer.Encode(input, new HashSet<string>());
return tokens.Count;
};

private static readonly Func<TokenCounterType, TokenCounter> s_tokenCounterFactory = (TokenCounterType counterType) =>
{
switch (counterType)
{
case TokenCounterType.SharpToken:
return (string input) => SharpTokenTokenCounter(input);
case TokenCounterType.MicrosoftML:
return (string input) => MicrosoftMLTokenCounter(input);
case TokenCounterType.DeepDev:
return (string input) => DeepDevTokenCounter(input);
case TokenCounterType.MicrosoftMLRoberta:
return (string input) => MicrosoftMLRobertaTokenCounter(input);
default:
throw new ArgumentOutOfRangeException(nameof(counterType), counterType, null);
}
};
}
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@
<PackageReference Include="Newtonsoft.Json" />
<PackageReference Include="Polly" />
<PackageReference Include="SharpToken" />
<PackageReference Include="Microsoft.ML.Tokenizers" />
<PackageReference Include="Microsoft.DeepDev.TokenizerLib" />
<PackageReference Include="System.Linq.Async" />
</ItemGroup>
<ItemGroup>
Expand Down Expand Up @@ -55,7 +57,6 @@
<ProjectReference Include="..\..\src\Plugins\Plugins.Web\Plugins.Web.csproj" />
<ProjectReference Include="..\..\src\SemanticKernel.Core\SemanticKernel.Core.csproj" />
<ProjectReference Include="..\NCalcPlugins\NCalcPlugins.csproj" />

<!-- Because some of the referenced projects have dependencies that themselves have System.Text.Json set with a minimum of 7.0. -->
<PackageReference Include="System.Text.Json" />
<PackageVersion Update="System.Text.Json" Version="7.0.3" />
Expand All @@ -64,6 +65,8 @@
<EmbeddedResource Include="Resources\30-user-prompt.txt" />
<EmbeddedResource Include="Resources\30-system-prompt.txt" />
<EmbeddedResource Include="Resources\30-user-context.txt" />
<EmbeddedResource Include="Resources\EnglishRoberta\dict.txt" />
<EmbeddedResource Include="Resources\EnglishRoberta\encoder.json" />
<EmbeddedResource Include="Resources\EnglishRoberta\vocab.bpe" />
</ItemGroup>

</Project>
11 changes: 11 additions & 0 deletions dotnet/samples/KernelSyntaxExamples/Resources/EmbeddedResource.cs
Original file line number Diff line number Diff line change
Expand Up @@ -35,4 +35,15 @@ internal static string Read(string fileName)
using var reader = new StreamReader(resource);
return reader.ReadToEnd();
}

internal static Stream? ReadStream(string fileName)
{
// Get the current assembly. Note: this class is in the same assembly where the embedded resources are stored.
Assembly? assembly = typeof(EmbeddedResource).GetTypeInfo().Assembly;
if (assembly == null) { throw new ConfigurationException($"[{s_namespace}] {fileName} assembly not found"); }

// Resources are mapped like types, using the namespace and appending "." (dot) and the file name
var resourceName = $"{s_namespace}." + fileName;
return assembly.GetManifestResourceStream(resourceName);
}
}
Loading

0 comments on commit 4a2cf70

Please sign in to comment.