Major changes
N/A
New features
- [ALL] Added SentencePieceNormalizer class in C++/Python. It supports almost the equivalent feature of spm_normalize. Python Sample C++ Sample
- [ALL] Added SentencePieceProcessor::Normalize method in C++/Python Python Sample
C++ Sample - [ALL] Added functionality to override the normalization spec before the processing. Python Sample
Bug fixes & minor changes
- Introduce better support of using external abseil and protobuf #869
- Build universal binary in OSX release package #892
- Add the set_min_log_level function to python to change the loglevel from the python wrapper. #893
- Uses the logsumexp techniques in marginal probabilities of n-best tokenization to avoid underflow.
- Support Python 3.12 #932
- Improves the thread utilization in batch encoding/decoding.
- Fix nasty bug in BPE position encoding.
- Fix bugs in the handling of duplicated bigrams