Use Intl.Segmenter instead of ssplit for segmentation in WASM bui…

…lds (#945) * Fix inference in CODEOWNERS file This fixes an oversight from a previous PR when `inference-engine` was renamed to `inference`, however the path was not updated in `CODEOWNERS`. * Improve eslint string-formatting configuration This is a miscellaneous change to the eslint config that now allows different string types based on whether certain types of quotes need to be escaped within the string. * Add a --force-rebuild flag to WASM build commands This commit adds a --force-rebuild flag to the WASM build commands that will trigger a rebuild without having to fully clobber and start over. * Fix misc. formatting in build-bergamot.py This commit fixes miscellaneous formatting that I noticed looked misaligned in the terminal. For some reason, some emojis need two spaces after them, when other emojis only need one space to achieve the same alignment. * Rename `appendEndingWhitespace` This commit renames `appendEndingWhitespace` to `handleEndingWhitespace`, because the whitespace logic will be made more complex by this PR, and whitspace is no longer guaranteed to be appended. * Add capability to register languages This commit adds the capability for several of the C++ classes to register either a source language tag or a target language tag (depending on their needs). I had experimented with changing the constructors themselves, but mtaintaining backward compatibility got messy very quickly with native builds continuing to use `ssplit` and WASM builds now using `Intl.Segmenter`. The least-invasive and cleanest-to-implement compromise that I came up with was to add WASM-specific functionality to register the language tags for classes after construction. * Implement WASM segmentation with `Intl.Segmenter` This is the largest commit of the stack, and likely the one to pay the most attention to. In addition to utilizing `Intl.Segmenter` instead of `ssplit` when segmenting text in WASM builds, this patch also necessarily modifies the logic of how whitespce is handled during translations. We now have to concern ourselves with whether the source language and/or target language utilize whitespace between sentences or omit whitespace between sentences. For example: * When translating from Chinese to English, then whitespace must be added between sentences. * When translating from English to Chinese, then whitespace must be removed between sentences. * When translating form Chinese to Japanese, then whitesapce must be inserted between sentences for the English pivot, and then removed for the final output. * Remove WASM dependency on ssplit This commit entirely removes the build dependency on `ssplit` when building the WASM target. This actually ultimately reduces the size of the compiled WASM binary from 5.01 MB to 4.73 MB. * Bump Bergamot Version 0.4.5 => 0.5.0 * Update WASM Bindings Part 1 of 2 This commit updates the WASM bindings to take the source language and target language tags in order to construct the TranslationModels that now utilize the locale-specific `Intl.Segmenter`. This effectively takes the `LanguageTranslationModelFiles` object and makes that a sub-object of `TranslationModelPayload`, which includes the language tags as well as the files. This hierarchical separation is ideal, because the `LanguageTranslationModelFiles` object is designed to be iterated over and chunked into aligned memory, where as the language tags are plain strings that are distinctly separate in the way that they are handled. * Rework TranslationsEngine to utilize new bindings Part 2 of 2 This commit reworks the TranslationsEngine worker code to utilize the new bindings implemented in the previous commit. * Insert whitespace between full-width punctuation and opening quotes This commit introduces extra logic to the text cleaning that purposely inserts whitespace into CJK text to trick the segmenter into doing the right thing. See the in-code comment for more context. * Add `zhen` test model files This commit adds our work-in-progress `zhen` model to the repository for use in testing. * Add test cases for testing `zhen` models. This commit adds several test cases for translating from Chinese into other languages, which will both guard against regressions and demonstrate correct segmentation behavior. * Add temporary `enzh` models for testing Part 1 of 2 The final two commits of this stack may be slightly controversial. We do not currently have a viable `enzh` model, even for testing purposes, however, I need to test the functionality of removing whitespace between sentences for target languages that require it. This patch adds our `enes` models under the `enzh` directory, which will trick the implementation into translating into "Chinese" with a Spanish output. The key difference is that the Spanish output should not include spaces between sentences, which is, in my opinion, good enough for testing in the interim. * Add makeshift `enzh` tests Part 2 of 2 This patch adds test cases for translating into "Chinese", which at present, is actually a Spanish translation that omits spaces between sentences.
mozilla · Dec 3, 2024 · f901047 · f901047
1 parent 9e8641b
commit f901047
Show file tree

Hide file tree

Showing 41 changed files with 1,132 additions and 95 deletions.
diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
@@ -21,7 +21,7 @@ tests @mozilla/translations-training
 tracking @mozilla/translations-training
 
 # Translations Inference review group
-inference-engine @mozilla/translations-inference
+inference @mozilla/translations-inference
 
 # Taskcluster pipeline related files. Changes to these ought to be reviewed by
 # RelEng to watch for security issues and best practices. These should also

diff --git a/inference/3rd_party/CMakeLists.txt b/inference/3rd_party/CMakeLists.txt
@@ -8,16 +8,18 @@ if(COMPILE_WASM)
   add_link_options(${WASM_LINK_FLAGS})
 endif(COMPILE_WASM)
 
-add_subdirectory(ssplit-cpp EXCLUDE_FROM_ALL)
 
 # Add include directories for 3rd party targets to be able to use it anywhere in the
 # project without explicitly specifying their include directories. Once they
 # fixe this problem, it can be removed.
 get_property(INCDIRS DIRECTORY browsermt-marian-dev/src PROPERTY INCLUDE_DIRECTORIES)
 target_include_directories(marian PUBLIC ${INCDIRS})
 
-get_property(INCLUDE_DIRECTORIES DIRECTORY ssplit-cpp/src PROPERTY INCLUDE_DIRECTORIES)
-target_include_directories(ssplit PUBLIC ${INCLUDE_DIRECTORIES})
+if(NOT COMPILE_WASM)
+  add_subdirectory(ssplit-cpp EXCLUDE_FROM_ALL)
+  get_property(INCLUDE_DIRECTORIES DIRECTORY ssplit-cpp/src PROPERTY INCLUDE_DIRECTORIES)
+  target_include_directories(ssplit PUBLIC ${INCLUDE_DIRECTORIES})
+endif(NOT COMPILE_WASM)
 
 get_property(COMPILE_DEFINITIONS DIRECTORY browsermt-marian-dev PROPERTY COMPILE_DEFINITIONS) 
 target_compile_definitions(marian PUBLIC ${COMPILE_DEFINITIONS})

diff --git a/inference/BERGAMOT_VERSION b/inference/BERGAMOT_VERSION
@@ -1 +1 @@
-v0.4.5
+v0.5.0
diff --git a/inference/CMakeLists.txt b/inference/CMakeLists.txt
@@ -88,12 +88,14 @@ cmake_dependent_option(ENABLE_CACHE_STATS "Enable stats on cache" ON "COMPILE_TE
 SET(COMPILE_CUDA OFF CACHE BOOL "Compile GPU version")
 SET(USE_SENTENCEPIECE ON CACHE BOOL "Download and compile SentencePiece")
 SET(USE_STATIC_LIBS ON CACHE BOOL "Link statically against non-system libs")
-SET(SSPLIT_COMPILE_LIBRARY_ONLY ON CACHE BOOL "Do not compile ssplit tests")
+
+if (NOT COMPILE_WASM)
+  SET(SSPLIT_COMPILE_LIBRARY_ONLY ON CACHE BOOL "Do not compile ssplit tests")
+endif(NOT COMPILE_WASM)
+
 if (USE_WASM_COMPATIBLE_SOURCE)
   SET(COMPILE_LIBRARY_ONLY ON CACHE BOOL "Build only the Marian library and exclude all executables.")
   SET(USE_MKL OFF CACHE BOOL "Compile with MKL support")
-  # # Setting the ssplit-cpp submodule specific cmake options for wasm
-  SET(SSPLIT_USE_INTERNAL_PCRE2 ON CACHE BOOL "Use internal PCRE2 instead of system PCRE2")
 endif()
 
 # Documentation: https://cliutils.gitlab.io/modern-cmake/chapters/projects/submodule.html

diff --git a/inference/scripts/build-wasm.py b/inference/scripts/build-wasm.py
@@ -37,6 +37,7 @@
     formatter_class=argparse.RawTextHelpFormatter,
 )
 parser.add_argument("--clobber", action="store_true", help="Clobber the build artifacts")
+parser.add_argument("--force-rebuild", action="store_true", help="Force rebuilding the artifacts")
 parser.add_argument(
     "--debug",
     action="store_true",
@@ -153,7 +154,7 @@ def build_bergamot(args: Optional[list[str]]):
     if not os.path.exists(BUILD_PATH):
         os.mkdir(BUILD_PATH)
 
-    print("\n🖌️ Applying source code patches\n")
+    print("\n🖌️  Applying source code patches\n")
     for repo_path, patch_path in patches:
         apply_git_patch(repo_path, patch_path)
 
@@ -200,7 +201,7 @@ def run_shell(command):
             print("Please try running again with -j 1.")
             raise
 
-        print("\n🪚 Patching Bergamot for gemm support\n")
+        print("\n🪚  Patching Bergamot for gemm support\n")
         subprocess.check_call(["bash", GEMM_SCRIPT, BUILD_PATH])
 
         print("\n✅ Build complete\n")
@@ -223,7 +224,7 @@ def run_shell(command):
         prepare_js_artifact()
 
     finally:
-        print("\n🖌️ Reverting the source code patches\n")
+        print("\n🖌️  Reverting the source code patches\n")
         for repo_path, patch_path in patches[::-1]:
             revert_git_patch(repo_path, patch_path)
 
@@ -236,6 +237,7 @@ def main():
         and os.path.isdir(BUILD_PATH)
         and os.listdir(BUILD_PATH)
         and not args.clobber
+        and not args.force_rebuild
     ):
         print(f"\n🏗️  Build directory {BUILD_PATH} already exists and is non-empty.\n")
         print("   Pass the --clobber flag to rebuild if desired.")

diff --git a/inference/scripts/test-wasm.py b/inference/scripts/test-wasm.py
@@ -33,6 +33,9 @@ def main():
     )
 
     parser.add_argument("--clobber", action="store_true", help="Clobber the build artifacts")
+    parser.add_argument(
+        "--force-rebuild", action="store_true", help="Force rebuilding the artifacts"
+    )
     parser.add_argument(
         "--debug",
         action="store_true",
@@ -49,6 +52,8 @@ def main():
     build_command = [sys.executable, build_wasm_script]
     if args.clobber:
         build_command.append("--clobber")
+    if args.force_rebuild:
+        build_command.append("--force-rebuild")
     if args.debug:
         build_command.append("--debug")
     if args.j:

diff --git a/inference/src/tests/units/html_tests.cpp b/inference/src/tests/units/html_tests.cpp
@@ -762,7 +762,7 @@ TEST_CASE("End-to-end translation", "[!mayfail]") {
         string_view(sentence_str.data() + 42, 0),  // 0.7 ""
     };
     source.appendSentence("", sentence.begin(), sentence.end());
-    source.appendEndingWhitespace("</p>");
+    source.handleEndingWhitespace("</p>", /* isBetweenSentences */ false);
 
     CHECK(asTokens(response.source) == asTokens(source));
   }
@@ -783,7 +783,7 @@ TEST_CASE("End-to-end translation", "[!mayfail]") {
         string_view(sentence_str.data() + 52, 0),   // 0.8 ""
     };
     target.appendSentence("", sentence.begin(), sentence.end());
-    target.appendEndingWhitespace("</p>");
+    target.handleEndingWhitespace("</p>", /* isBetweenSentences */ false);
 
     CHECK(asTokens(response.target) == asTokens(target));
   }
@@ -853,7 +853,7 @@ TEST_CASE("End-to-end translation when no words with markup align", "[!mayfail]"
         string_view(sentence_str.data() + 42, 0),  // 0.7 ""
     };
     source.appendSentence("", sentence.begin(), sentence.end());
-    source.appendEndingWhitespace("</p>");
+    source.handleEndingWhitespace("</p>", /* isBetweenSentences */ false);
 
     CHECK(asTokens(response.source) == asTokens(source));
   }
@@ -871,7 +871,7 @@ TEST_CASE("End-to-end translation when no words with markup align", "[!mayfail]"
         string_view(sentence_str.data() + 39, 0),   // 0.6 [EOS]
     };
     target.appendSentence("", sentence.begin(), sentence.end());
-    target.appendEndingWhitespace("</p>");
+    target.handleEndingWhitespace("</p>", /* isBetweenSentences */ false);
 
     CHECK(asTokens(response.target) == asTokens(target));
   }

diff --git a/inference/src/translator/CMakeLists.txt b/inference/src/translator/CMakeLists.txt
@@ -38,7 +38,12 @@ if(ENABLE_CACHE_STATS)
     target_compile_definitions(bergamot-translator-source PUBLIC ENABLE_CACHE_STATS)
 endif(ENABLE_CACHE_STATS)
 
-target_link_libraries(bergamot-translator-source marian ssplit)
+if(NOT COMPILE_WASM)
+  target_link_libraries(bergamot-translator-source marian ssplit)
+elseif(COMPILE_WASM)
+  target_link_libraries(bergamot-translator-source marian)
+endif(COMPILE_WASM)
+
 
 target_include_directories(bergamot-translator-source
     PUBLIC ${PROJECT_SOURCE_DIR}

diff --git a/inference/src/translator/annotation.cpp b/inference/src/translator/annotation.cpp
@@ -15,7 +15,7 @@ void AnnotatedText::appendSentence(string_view prefix, std::vector<string_view>:
   assert(annotation.token_begin_.back() == text.size());
 
   // prefix is just end of the previous one.
-  appendEndingWhitespace(prefix);
+  handleEndingWhitespace(prefix, /* isBetweenSentences */ true);
 
   // Appending sentence text.
   std::size_t offset = text.size();
@@ -29,13 +29,95 @@ void AnnotatedText::appendSentence(string_view prefix, std::vector<string_view>:
   }
 
   // Add the gap after the sentence.  This is empty for now, but will be
-  // extended with appendEndingWhitespace or another appendSentence.
+  // extended with handleEndingWhitespace or another appendSentence.
   annotation.gap_.push_back(annotation.token_begin_.size() - 1);
   annotation.token_begin_.push_back(offset);
 }
 
-void AnnotatedText::appendEndingWhitespace(string_view whitespace) {
-  text.append(whitespace.data(), whitespace.size());
+bool AnnotatedText::shouldOmitSpaceBetweenSentences() const {
+  if (targetLanguage_.empty()) {
+    // The target language is not specified, so we should not make assumptions about
+    // whether or not the language's script should omit whitespace.
+    return false;
+  }
+
+  // TODO(https://github.com/mozilla/translations/issues/950)
+  // More robustly handle which language tags should omit whitespace between sentences.
+  return (
+    // Japanese does not use space between sentences.
+    targetLanguage_ == "ja" ||
+    // Korean does not use space between sentences.
+    targetLanguage_ == "ko" ||
+    // Chinese does not use space between sentences.
+    targetLanguage_ == "zh"
+  );
+}
+
+bool AnnotatedText::shouldEnsureSpaceBetweenSentences() const {
+  if (targetLanguage_.empty()) {
+    // The target language is not specified, so we should not make assumptions about
+    // whether or not the language's script should omit whitespace.
+    return false;
+  }
+
+  return !shouldOmitSpaceBetweenSentences();
+}
+
+void AnnotatedText::maybeAppendHTMLTagsFromGap(string_view gap) {
+  // We can be sure that the gap between sentences is one of the following:
+  //  - Empty
+  //  - Whitespace
+  //  - One or more well-formed HTML tags, e.g. "</b></em>".
+  //  - A mixture of whitespace and HTML tags, e.g. "</b></em>  ".
+  size_t currentIndex = 0;
+  while (currentIndex < gap.size()) {
+    // Find the next open bracket '<' for an HTML tag.
+    size_t tagStart = gap.find('<', currentIndex);
+    if (tagStart == string_view::npos) {
+      // No more HTML tags were found.
+      return;
+    }
+
+    // Find the matching closing bracket '>' for this HTML tag.
+    size_t tagEnd = gap.find('>', tagStart + 1);
+    if (tagEnd == string_view::npos) {
+      // The tag is missing its closing angle bracket.
+      // This should never happen, since the DOM parser should ensure the tags are well formed.
+      // But if we do encounter this case, the best thing we can do at this point ignore the tag.
+      return;
+    }
+
+    size_t tagLength = 1 + tagEnd - tagStart;
+    string_view tag = gap.substr(tagStart, tagLength);
+    text.append(tag.data(), tag.size());
+
+    currentIndex = tagEnd + 1;
+  }
+}
+
+void AnnotatedText::handleEndingWhitespace(string_view gap, bool isBetweenSentences) {
+  if (gap.find("\n") != string_view::npos) {
+    // The gap contains a line break, so we should preserve it regardless.
+    text.append(gap.data(), gap.size());
+  } else if (shouldOmitSpaceBetweenSentences()) {
+    // Even if we are supposed to omit gap between sentences, the gap between
+    // the sentences may contain HTML tags that we still need to preserve.
+    maybeAppendHTMLTagsFromGap(gap);
+  } else if (!gap.empty()) {
+    // We are not explicitly omitting gap and there is gap to preserve.
+    text.append(gap.data(), gap.size());
+  } else if (
+    // This is a gap between sentences (i.e. not at the end of the whole text).
+    isBetweenSentences &&
+    // This the current language/script should have space between sentences.
+    shouldEnsureSpaceBetweenSentences() &&
+    // The previous sentence is not empty.
+    !text.empty()
+  ) {
+    // The given gap was empty, but but target language requires a space between sentences.
+    text += ' ';
+  }
+
   annotation.token_begin_.back() = text.size();
 }
 

diff --git a/inference/src/translator/annotation.h b/inference/src/translator/annotation.h
@@ -135,14 +135,15 @@ struct AnnotatedText {
 
   /// Appends a sentence to the existing text and transparently rebases
   /// string_views.  Since this tracks only prefix, remember
-  /// appendEndingWhitespace.
+  /// handleEndingWhitespace.
   /// The string_views must not already be in text.
   void appendSentence(string_view prefix, std::vector<string_view>::iterator tokens_begin,
                       std::vector<string_view>::iterator tokens_end);
 
   /// Append the whitespace at the end of input. string_view must not be in
-  /// text.
-  void appendEndingWhitespace(string_view whitespace);
+  /// text. In WASM builds, whitespace may be omitted or inserted depending
+  /// on the script of the target language.
+  void handleEndingWhitespace(string_view whitespace, bool isBetweenSentences);
 
   /// Record the existence of a sentence that is already in text.  The
   /// iterators are over string_views for each token that must be in text
@@ -185,13 +186,21 @@ struct AnnotatedText {
   /// Returns a ByteRange representing sentence corresponding to sentenceIdx.
   ByteRange sentenceAsByteRange(size_t sentenceIdx) const { return annotation.sentence(sentenceIdx); }
 
+  /// Registers the target language that the annotated text will be translated into.
+  /// For WASM builds that support CJK languages, this has an effect on the whitespace
+  /// that may or may not be inserted into the resulting text.
+  void registerTargetLanguage(const std::string& language) {
+    targetLanguage_ = language;
+  }
+
   /// Utility function to call `fun` on each word (subword token effectively) in
   /// an `AnnotatedText`. `fun` is called with the `ByteRange`, the `string_view`
   /// with the word, and a `bool` to indicate whether it is the last word in the
   /// `AnnotatedText`, which is also the ending whitespace slot of AnnotatedText.
   template <typename Fun>
   AnnotatedText apply(Fun fun) const {
     AnnotatedText out;
+    out.registerTargetLanguage(targetLanguage_);
 
     for (size_t sentenceIdx = 0; sentenceIdx < numSentences(); ++sentenceIdx) {
       std::string sentence;
@@ -215,12 +224,54 @@ struct AnnotatedText {
       out.appendSentence(prefix, views.begin(), views.end());
     }
 
-    out.appendEndingWhitespace(fun(annotation.gap(numSentences()), gap(numSentences()), true));
+    out.handleEndingWhitespace(fun(annotation.gap(numSentences()), gap(numSentences()), true),
+                               /* isBetweenSentences */ false);
 
     return out;
   }
 
  private:
+  /// The target language that this annotated text will be translated into.
+  /// This remains empty for non-WASM builds that use ssplit for segmentation.
+  /// This value is populated in WASM builds that utilize a locale-specific segmenter.
+  ///
+  /// Note: This is not excluded from non-WASM builds using preprocessor directives because
+  ///       simply including it in the program logic greatly reduces the number of locations
+  ///       where preprocessor directives would be required.
+  std::string targetLanguage_ = "";
+
+  /// Some languages do not utilize space between sentences. If we are translating from a language
+  /// that does utilize spaces into a language that does not utilize spaces, then we need to know whether
+  /// the space between sentences should be removed in the final text.
+  ///
+  /// Returns true if the translation should ensure that whitespace is omitted between sentences
+  /// in the target language. This is dependent on the script of the target language.
+  bool shouldOmitSpaceBetweenSentences() const;
+
+  /// Some languages utilize space between sentences. If we are translating from a language that
+  /// does not utilize spaces into a language that does utilize spaces, then we need to know whether
+  /// the space between sentences should be ensured in the final text.
+  ///
+  /// Returns true if the translation should ensure that whitespace is inserted between sentences
+  /// in the target language. This is dependent on the script of the target language.
+  bool shouldEnsureSpaceBetweenSentences() const;
+
+  /// The current algorithm annotates text into sentences separated by gaps. The gaps between sentences
+  /// may be empty, they may contain whitespace, they may contain parsed HTML tags that existed between
+  /// the sentences, or a combination of whitespace and tags.
+  ///
+  /// For example, the following annotated text will contain the following three gaps:
+  ///
+  ///      "<b>This is my bold sentence.</b> This is my regular sentence."
+  ///    │ │                           │</b> │                           │ │
+  ///    └┬┘                           └──┬──┘                           └┬┘
+  /// Empty gap                Gap with tag and space                 Empty gap
+  ///
+  /// In cases where we are translating from a language that utilizes whitespace between sentences into
+  /// a language that does not utilize whitespace between sentences, we need to extract the tags from
+  /// the gap and omit the whitespace in the gap.
+  void maybeAppendHTMLTagsFromGap(string_view gap);
+
   string_view asStringView(const ByteRange &byteRange) const {
     return string_view(text.data() + byteRange.begin, byteRange.size());
   }

diff --git a/inference/src/translator/response_builder.cpp b/inference/src/translator/response_builder.cpp
@@ -51,7 +51,8 @@ void ResponseBuilder::buildTranslatedText(Histories &histories, Response &respon
         // constructed, append the text till the end, which could be spaces or
         // empty.
         if (sentenceIdx + 1 == histories.size()) {
-          response.target.appendEndingWhitespace(response.source.gap(sentenceIdx + 1));
+          string_view endingWhitespace = response.source.gap(sentenceIdx + 1);
+          response.target.handleEndingWhitespace(endingWhitespace, /* isBetweenSentences */ false);
         }
         break;
       }