Skip to content

Commit

Permalink
feat!: special tokens encoded by default
Browse files Browse the repository at this point in the history
Special tokens are now also encoded by both Huggingface and Tiktoken tokenizers. This is closer to the default behavior on the Python side, and should make sure if a model adds tokens at the beginning or end of a sequence, these are accounted for as well.
  • Loading branch information
benbrandt committed Jan 16, 2025
1 parent 7d72641 commit a72c0ff
Show file tree
Hide file tree
Showing 13 changed files with 4,000 additions and 3,268 deletions.
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
# Changelog

## v0.21.0

### Breaking Changes

- Special tokens are now also encoded by both Huggingface and Tiktoken tokenizers. This is closer to the default behavior on the Python side, and should make sure if a model adds tokens at the beginning or end of a sequence, these are accounted for as well.

## v0.20.1

### Fixes
Expand Down
4 changes: 2 additions & 2 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 1 addition & 2 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
[workspace]
members = ["bindings/*"]

[workspace.package]
version = "0.20.1"
version = "0.21.0"
authors = ["Ben Brandt <[email protected]>"]
edition = "2021"
description = "Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens, and is callable from Rust and Python."
Expand Down
13 changes: 8 additions & 5 deletions src/chunk_size/huggingface.rs
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ impl ChunkSizer for &Tokenizer {
/// encounters text it can't tokenize.
fn size(&self, chunk: &str) -> usize {
let encoding = self
.encode(chunk, false)
.encode(chunk, true)
.expect("Unable to tokenize the following string {chunk}");

let pad_id = self.get_padding().map(|params| params.pad_id);
Expand Down Expand Up @@ -61,7 +61,8 @@ mod tests {
fn returns_size() {
let tokenizer = Tokenizer::from_pretrained("bert-base-cased", None).unwrap();
let size = tokenizer.size(" An apple a");
assert_eq!(size, 3);
// Bert has a beginning and end token
assert_eq!(size, 5);
}

#[test]
Expand All @@ -77,7 +78,8 @@ mod tests {
fn handles_padding() {
let tokenizer = Tokenizer::from_pretrained("thenlper/gte-small", None).unwrap();
let size = tokenizer.size("An apple a");
assert_eq!(size, 3);
// Has a beginning and end token
assert_eq!(size, 5);
}

#[test]
Expand All @@ -87,8 +89,9 @@ mod tests {

// Need to ensure chunk is large enough to cause Encoding overflows.
assert_eq!(
tokenizer.size("An apple a day keeps the doctor away.".repeat(100).as_str()),
900
tokenizer.size(" An apple a day keeps the doctor away".repeat(16).as_str()),
// Overflows at 128, with special tokens at beginning and end of each section of tokens
132
);
}
}
2 changes: 1 addition & 1 deletion src/chunk_size/tiktoken.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ use crate::ChunkSizer;
impl ChunkSizer for &CoreBPE {
/// Returns the number of tokens in a given text after tokenization.
fn size(&self, chunk: &str) -> usize {
self.encode_ordinary(chunk).len()
self.encode_with_special_tokens(chunk).len()
}
}

Expand Down
1,172 changes: 655 additions & 517 deletions tests/snapshots/snapshots__romeo_and_juliet_Tokenizers_trim_32.snap

Large diffs are not rendered by default.

Large diffs are not rendered by default.

1,172 changes: 655 additions & 517 deletions tests/snapshots/snapshots__romeo_and_juliet_Tokenizers_trim_false_32.snap

Large diffs are not rendered by default.

Large diffs are not rendered by default.

2,392 changes: 1,307 additions & 1,085 deletions tests/snapshots/snapshots__room_with_a_view_Tokenizers_trim_32.snap

Large diffs are not rendered by default.

Large diffs are not rendered by default.

2,392 changes: 1,307 additions & 1,085 deletions tests/snapshots/snapshots__room_with_a_view_Tokenizers_trim_false_32.snap

Large diffs are not rendered by default.

Large diffs are not rendered by default.

0 comments on commit a72c0ff

Please sign in to comment.