- Type stubs for the Python bindings are now available, allowing better static code analysis, better code completion in supported IDEs and easier understanding of the library's API.
- The method
LanguageDetector.detect_multiple_languages_of
still returned character indices instead of byte indices when only a singleDetectionResult
was produced. This has been fixed.
-
The method
LanguageDetector.detect_multiple_languages_of
returns byte indices. For creating string slices in Python and JavaScript, character indices are needed but were not provided. This resulted in incorrectDetectionResult
s for Python and JavaScript. This has been fixed now by converting the byte indices to character indices. -
Some minor bugs in the WASM module have been fixed to prepare the first release of Lingua for JavaScript.
-
Python bindings are now available for the library. These bindings replace the pure Python implementation of Lingua in order to benefit from Rust's performance in any Python software. (#262)
-
Parallel equivalents for all methods in
LanguageDetector
have been added to give the user the choice of using the library single-threaded or multi-threaded. (#271)
-
Several bugs in multiple languages detection have been fixed that caused incomplete results to be returned in several cases.
-
A significant amount of Kazakh texts were incorrectly classified as Mongolian. This has been fixed.
-
The new method
LanguageDetector.detect_multiple_languages_of()
has been introduced. It allows to detect multiple languages in mixed-language text. (#1) -
The new method
LanguageDetectorBuilder.with_low_accuracy_mode()
has been introduced. By activating it, detection accuracy for short text is reduced in favor of a smaller memory footprint and faster detection performance. (#119) -
The new method
LanguageDetector.compute_language_confidence()
has been introduced. It allows to retrieve the confidence value for one specific language only, given the input text. (#102)
-
The computation of the confidence values has been revised and the softmax function is now applied to the values, making them better comparable by behaving more like real probabilities. (#120)
-
The WASM API has been revised. Now it makes use of the same builder pattern as the Rust API. (#122)
-
The language model files are now compressed with the Brotli algorithm which reduces the file size by 15 %, on average. (#189)
-
The language model ngrams are now stored in a
CompactString
type which reduces the amount of consumed memory by 20 %. (#198) -
Several performance optimizations have been applied which makes the library nearly twice as fast as the previous version. Big thanks go out to @serega and @koute for their help. (#82, #148, #177)
-
The enums
IsoCode639_1
andIsoCode639_3
now implement some new traits such asCopy
,Hash
and Serde'sSerialize
andDeserialize
. The enumLanguage
now implementsCopy
as well. (#175)
- The library can now be compiled to WebAssembly and be used in any JavaScript project. Big thanks to @martindisch for bringing this forward. (#14)
- Some minor performance tweaks have been applied to the rule engine.
- This release updates outdated dependencies and fixes an incompatibility
between different versions of the
include_dir
crate which are used in the mainlingua
crate and the language model crates.
- Another compilation error has been fixed which occurred when the Latin language was left out as Cargo feature.
- When Chinese, Japanese or Korean were left out as Cargo features, there were compilation errors. This has been fixed.
- The language model dependencies are separate Cargo features now. Users can decide which languages shall be downloaded and used in the library. (#12)
- The code that does the lazy-loading of the language models has been refactored significantly, making the code more stable and less error-prone.
- In very rare cases, the language returned by the detector was non-deterministic. This has been fixed. Big thanks to @asg0451 for identifying this problem. (#17)
- The enums
Language
,IsoCode639_1
andIsoCode639_3
now implementstd::str::FromStr
in order to instantiate enum variants by string values. This comes in handy for JavaScript bindings and the like. (#15)
- The performance of preloading the language models has been improved.
- Language detection for sentences with more than 120 characters was supposed to be done by iterating through trigrams only but this was never the case. This has been corrected.
- Language detection for sentences with more than 120 characters now performs more quickly by iterating through trigrams only which is enough to achieve high detection accuracy.
- Textual input that includes logograms from Chinese, Japanese or Korean is now split at each logogram and not only at whitespace. This provides for more reliable language detection for sentences that include multi-language content.
- Errors in the rule engine for the Latvian language have been resolved.
- Corrupted characters in the Latvian test data have been corrected.
- A
LanguageDetector
can now be built with lazy-loading required language models on demand (default) or with preloading all language models at once by callingLanguageDetectorBuilder.with_preloaded_language_models()
. (#10)
- The Maori language is now supported. Thanks to @eekkaiia for the contribution. (#5)
- Loading and searching the language models has been quite slow so far. Using parallel iterators from the Rayon library, this process is now at least 50% faster, depending on how many CPU cores are available. (#8)
- Accuracy reports are now also generated for the CLD2 library and included in the language detector comparison plots. (#6)
- Lingua could not be used within other projects because of a private serde module that was accidentally tried to be exposed. Thanks to @luananama for reporting this bug. (#9)
- Accidentally, bug #3 was only partially fixed. This has been corrected.
- When trying to create new language models, the
LanguageModelFilesWriter
panicked when it recognized characters in a text corpus that consist of multiple bytes. Thanks to @eekkaiia for reporting this bug. (#3)
This is the very first release of Lingua for Rust. Took me 5 months of hard work in my free time. Hope you find it useful. :)