Skip to content
Leroy "Spydaz" Dyer BSC/MSC edited this page Jul 29, 2023 · 1 revision

Tokenizer (VB.NET)

The Tokenizer is a versatile text processing library written in Visual Basic (VB.NET). It provides functionalities for tokenizing text into words, sentences, characters, and n-grams. The library is designed to be flexible, customizable, and easy to integrate into your VB.NET projects. Features

Tokenize text into words
Tokenize text into sentences
Tokenize text into character-level tokens
Generate n-grams from text
Build vocabulary from tokenized text
Normalize input text (lowercase, remove punctuation, etc.)
Remove stop words from tokenized text
Customize tokenization behavior through various options.

Usage

Initialize the Tokenizer object:

Dim tokenizer As New Tokenizer() 'Tokenize text:

Dim text As String = "Hello, world! This is a sample sentence." Dim words As List(Of String) = tokenizer.TokenizeToWords(text) Dim sentences As List(Of String) = tokenizer.TokenizeToSentence(text) Dim characters As List(Of String) = tokenizer.TokenizeToCharacter(text)

'Generate Ngrams

Dim sentence As String = "The quick brown fox jumps over the lazy dog." Dim ngramSize As Integer = 3 Dim ngrams As List(Of Tokenizer.Token) = tokenizer.CreateNgrams(sentence, ngramSize)

'Build Vocabulary Dim words As List(Of String) = tokenizer.TokenizeToWords(text) tokenizer.UpdateVocabulary(words) Dim vocabulary As Dictionary(Of String, Integer) = tokenizer.VocabularyWithFrequency

'Stop Words tokenizer.StopWords = New List(Of String) From {"is", "a", "the"} ' Set custom stop words tokenizer.StopWordRemovalEnabled = True ' Enable stop word removal tokenizer.NGramSize = 2 ' Set n-gram size

'Tokenize and Build Vocabulary Dim tokenizer As New Tokenizer() Dim text As String = "This is a sample text. It contains multiple sentences." Dim words As List(Of String) = tokenizer.TokenizeToWords(text)

tokenizer.UpdateVocabulary(words) Dim vocabulary As Dictionary(Of String, Integer) = tokenizer.VocabularyWithFrequency

For Each token As KeyValuePair(Of String, Integer) In vocabulary Console.WriteLine("Token: " & token.Key & ", Frequency: " & token.Value) Next

'Tokenize to Chars Dim tokenizer As New Tokenizer() Dim text As String = "Hello, world!" Dim characters As List(Of String) = tokenizer.TokenizeToCharacter(text)

For Each character As String In characters Console.WriteLine(character) Next

About

The Tokenizer is a versatile text processing library written in Visual Basic (VB.NET). It provides functionalities for tokenizing text into words, sentences, characters, and n-grams. The library is designed to be flexible, customizable, and easy to integrate into your VB.NET projects. Topics tokenizer pmi ngrams vocabulary-builder frequent-pattern-mining tokenization text-preprocessing bpe Branches Tags Activity Stars 0 stars Watchers 1 watching Forks 1 fork Report repository Releases No releases published Create a new release Packages No packages published Publish your first package Languages

Visual Basic .NET 100.0%

Suggested Workflows Based on your tech stack

Actions Importer Automatically convert CI/CD files to YAML for GitHub Actions. .NET logo .NET Build and test a .NET or ASP.NET Core project. .NET Desktop logo .NET Desktop

Build, test, sign and publish a desktop application built on .NET.

More workflows

Clone this wiki locally