-
Notifications
You must be signed in to change notification settings - Fork 34
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Use
Intl.Segmenter
instead of ssplit
for segmentation in WASM bui…
…lds (#945) * Fix inference in CODEOWNERS file This fixes an oversight from a previous PR when `inference-engine` was renamed to `inference`, however the path was not updated in `CODEOWNERS`. * Improve eslint string-formatting configuration This is a miscellaneous change to the eslint config that now allows different string types based on whether certain types of quotes need to be escaped within the string. * Add a --force-rebuild flag to WASM build commands This commit adds a --force-rebuild flag to the WASM build commands that will trigger a rebuild without having to fully clobber and start over. * Fix misc. formatting in build-bergamot.py This commit fixes miscellaneous formatting that I noticed looked misaligned in the terminal. For some reason, some emojis need two spaces after them, when other emojis only need one space to achieve the same alignment. * Rename `appendEndingWhitespace` This commit renames `appendEndingWhitespace` to `handleEndingWhitespace`, because the whitespace logic will be made more complex by this PR, and whitspace is no longer guaranteed to be appended. * Add capability to register languages This commit adds the capability for several of the C++ classes to register either a source language tag or a target language tag (depending on their needs). I had experimented with changing the constructors themselves, but mtaintaining backward compatibility got messy very quickly with native builds continuing to use `ssplit` and WASM builds now using `Intl.Segmenter`. The least-invasive and cleanest-to-implement compromise that I came up with was to add WASM-specific functionality to register the language tags for classes after construction. * Implement WASM segmentation with `Intl.Segmenter` This is the largest commit of the stack, and likely the one to pay the most attention to. In addition to utilizing `Intl.Segmenter` instead of `ssplit` when segmenting text in WASM builds, this patch also necessarily modifies the logic of how whitespce is handled during translations. We now have to concern ourselves with whether the source language and/or target language utilize whitespace between sentences or omit whitespace between sentences. For example: * When translating from Chinese to English, then whitespace must be added between sentences. * When translating from English to Chinese, then whitespace must be removed between sentences. * When translating form Chinese to Japanese, then whitesapce must be inserted between sentences for the English pivot, and then removed for the final output. * Remove WASM dependency on ssplit This commit entirely removes the build dependency on `ssplit` when building the WASM target. This actually ultimately reduces the size of the compiled WASM binary from 5.01 MB to 4.73 MB. * Bump Bergamot Version 0.4.5 => 0.5.0 * Update WASM Bindings Part 1 of 2 This commit updates the WASM bindings to take the source language and target language tags in order to construct the TranslationModels that now utilize the locale-specific `Intl.Segmenter`. This effectively takes the `LanguageTranslationModelFiles` object and makes that a sub-object of `TranslationModelPayload`, which includes the language tags as well as the files. This hierarchical separation is ideal, because the `LanguageTranslationModelFiles` object is designed to be iterated over and chunked into aligned memory, where as the language tags are plain strings that are distinctly separate in the way that they are handled. * Rework TranslationsEngine to utilize new bindings Part 2 of 2 This commit reworks the TranslationsEngine worker code to utilize the new bindings implemented in the previous commit. * Insert whitespace between full-width punctuation and opening quotes This commit introduces extra logic to the text cleaning that purposely inserts whitespace into CJK text to trick the segmenter into doing the right thing. See the in-code comment for more context. * Add `zhen` test model files This commit adds our work-in-progress `zhen` model to the repository for use in testing. * Add test cases for testing `zhen` models. This commit adds several test cases for translating from Chinese into other languages, which will both guard against regressions and demonstrate correct segmentation behavior. * Add temporary `enzh` models for testing Part 1 of 2 The final two commits of this stack may be slightly controversial. We do not currently have a viable `enzh` model, even for testing purposes, however, I need to test the functionality of removing whitespace between sentences for target languages that require it. This patch adds our `enes` models under the `enzh` directory, which will trick the implementation into translating into "Chinese" with a Spanish output. The key difference is that the Spanish output should not include spaces between sentences, which is, in my opinion, good enough for testing in the interim. * Add makeshift `enzh` tests Part 2 of 2 This patch adds test cases for translating into "Chinese", which at present, is actually a Spanish translation that omits spaces between sentences.
- Loading branch information
Showing
41 changed files
with
1,132 additions
and
95 deletions.
There are no files selected for viewing
Validating CODEOWNERS rules …
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
v0.4.5 | ||
v0.5.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.