Skip to content

Activity

Refactored auxiliary functions for segmenting with indices, with a bi…

GitMewpushed 1 commit to master • f4421b9…dbd1f8d • 
13 hours ago

AV evaluation now uses ChainedCounters to be able to average over fix…

GitMewpushed 1 commit to master • 67c17df…f4421b9 • 
5 days ago

Added ChainedCounter to the dictionary utils, for future moving-avera…

GitMewpushed 1 commit to master • b1dd593…67c17df • 
9 days ago

Added LZW vocabulariser.

GitMewpushed 1 commit to master • 5319c66…b1dd593 • 
11 days ago

Overhauled the accessor variety (AV) evaluation with many more metrics.

GitMewpushed 1 commit to master • 8f83323…5319c66 • 
13 days ago

Refactoring, and added a method for adding a progress bar to a NamedI…

GitMewpushed 1 commit to master • 518220c…8f83323 • 
15 days ago

Added multiplexer that selects based on maximal compression, and adde…

GitMewpushed 1 commit to master • cee9995…518220c • 
18 days ago

Added two new metrics for segmentation diversity.

GitMewpushed 1 commit to master • 0177ad4…cee9995 • 
27 days ago

Added more exceptional cases to Rényi entropy, and fixed two bugs wit…

GitMewpushed 1 commit to master • 082bdd7…0177ad4 • 
29 days ago

Heavily refactored VOLT code.

GitMewpushed 2 commits to master • 8d5c72f…082bdd7 • 
on Feb 5

Improved and added several evaluation tools.

GitMewpushed 1 commit to master • a7867dd…8d5c72f • 
on Feb 4

Added iterable implementation of integerPartitions_k.

GitMewpushed 1 commit to master • 53479db…a7867dd • 
on Jan 21

Better timer, fixed graph samplers, added identity tokeniser.

GitMewpushed 1 commit to master • 040c3aa…53479db • 
on Jan 18

Added graph-based rejection sampler for uniformly random segmentations.

GitMewpushed 1 commit to master • dc8d37c…040c3aa • 
on Jan 17

Added function for filtering Nones from iterables.

GitMewpushed 1 commit to master • 823bef4…dc8d37c • 
on Jan 12

Small fix in KudoPiece32ki deserialiser.

GitMewpushed 1 commit to master • 986858e…823bef4 • 
on Jan 4

Added histogram-based visualisation for token length and token amount.

GitMewpushed 1 commit to master • 436f4b8…986858e • 
on Dec 27, 2024

Miscellaneous refactoring.

GitMewpushed 1 commit to master • 7c95b78…436f4b8 • 
on Dec 14, 2024

Expanded README and added a logo.

GitMewpushed 1 commit to master • 423c7a3…7c95b78 • 
on Dec 4, 2024

Major preprocessing overhaul, now with full support for SentencePiece.

GitMewpushed 1 commit to master • 3cae630…423c7a3 • 
on Dec 3, 2024

Refactored tktkt.builders and tktkt.files into one submodule tktkt.fa…

GitMewpushed 1 commit to master • 8fa33f0…3cae630 • 
on Nov 30, 2024

Moved vocabulary builders away from the tokeniser builders because th…

GitMewpushed 1 commit to master • f2a2eb6…8fa33f0 • 
on Nov 28, 2024

Bugfixes and improvements.

GitMewpushed 1 commit to master • f17b652…f2a2eb6 • 
on Nov 26, 2024

Bugfixes and added nlpaug support.

GitMewpushed 1 commit to master • a5eb17b…f17b652 • 
on Nov 23, 2024

Added forwards and backwards segmentation graph samplers for usage in…

bauwenstpushed 1 commit to master • de9f4e5…a5eb17b • 
on Nov 3, 2024

Miscellaneous improvements.

bauwenstpushed 1 commit to master • 8449b2a…de9f4e5 • 
on Oct 31, 2024

Vocabulary builders to complement tokeniser builders.

bauwenstpushed 1 commit to master • 341ae85…8449b2a • 
on Oct 29, 2024

Fixed DeL's text data not being installed with the package (non-edita…

bauwenstpushed 1 commit to master • a1e279e…341ae85 • 
on Oct 27, 2024

Added more BPE visualisation examples and fixed small bug.

bauwenstpushed 1 commit to master • 1c2f645…a1e279e • 
on Oct 22, 2024

Small fix in a perturber and a printer.

bauwenstpushed 1 commit to master • e4de44f…1c2f645 • 
on Oct 15, 2024