Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terminology #448

Open
wants to merge 45 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
c43c57a
Basic terminology API
XapaJIaMnu May 9, 2023
06ec79b
reference to the code
XapaJIaMnu May 9, 2023
867fc6c
Update marian with gcc 12
kpu May 9, 2023
9153041
WiP python iface
XapaJIaMnu May 25, 2023
e5977e6
More WiP
XapaJIaMnu May 26, 2023
b1cb3bc
Works except stdin
XapaJIaMnu May 26, 2023
7a93f7c
Python interface
XapaJIaMnu May 26, 2023
5be7b96
Merge branch 'main' into terminology
XapaJIaMnu May 26, 2023
c1a659e
Small fixes, removes pybind submodule
XapaJIaMnu May 26, 2023
1f8ba76
Allow dictionary maps. Work in progress
XapaJIaMnu May 26, 2023
cc44014
Convert the map to python map
XapaJIaMnu May 26, 2023
6c7fe75
Allow dictionary terminology set up
XapaJIaMnu May 26, 2023
c586e09
Attempt to install pybind11 for the wheel build
XapaJIaMnu May 26, 2023
26529dc
Merge branch 'main' into terminology
XapaJIaMnu Jun 6, 2023
82cc687
Add support for different terminology format
XapaJIaMnu Jun 13, 2023
5c9161b
Try to update the workflows.
XapaJIaMnu Jun 14, 2023
7d6f4e5
Refactor terminology replace
jelmervdl Jun 15, 2023
f53879d
Fix formatting
jelmervdl Jun 15, 2023
a95001d
Update marian dev which should allow for compilation on newer platforms
XapaJIaMnu Jun 18, 2023
316c5dd
Fix for latest argparse
XapaJIaMnu Jun 28, 2023
58e5363
technology -> terminology
kpu Jun 28, 2023
0a6be45
Buffer input for efficiency
kpu Jun 28, 2023
ca37e8f
Pass terminology_form from CLI to Translator
graemenail Jul 4, 2023
4011f88
Leave USE_STATIC_LIBS off by default
kpu Jul 9, 2023
19ca40d
Enable cuda compilation
XapaJIaMnu Aug 1, 2023
1a8b90c
Merge branch 'main' into terminology
XapaJIaMnu Aug 1, 2023
1e80e79
Working, except in python
XapaJIaMnu Aug 2, 2023
3d37edf
Simplify invocation a bit
XapaJIaMnu Aug 2, 2023
e5d4ed0
Formatting fixes
XapaJIaMnu Aug 2, 2023
72ade1d
Update the terminology format
XapaJIaMnu Aug 4, 2023
5f9858f
Merge branch 'main' into terminology
XapaJIaMnu Aug 8, 2023
168d589
Use 0 GPU workers by default
XapaJIaMnu Aug 9, 2023
3eab045
Attempt to fix tests
XapaJIaMnu Aug 9, 2023
88e7f28
Fix error in workflow syntax
XapaJIaMnu Aug 9, 2023
1db9d09
Fix typing error
XapaJIaMnu Aug 9, 2023
537f4e1
I hate python linters
XapaJIaMnu Aug 9, 2023
042acc2
pytype can't access C++ modules
XapaJIaMnu Aug 9, 2023
e3b4a7c
Small fixes
XapaJIaMnu Aug 11, 2023
05a7379
Merge branch 'main' into terminology
XapaJIaMnu Oct 2, 2023
5479c20
Merge with main
XapaJIaMnu Oct 2, 2023
d2356a6
Merge branch 'main' into terminology
kpu Dec 7, 2023
97c8da4
Pull in submodule fixing clang compilation
kpu Dec 7, 2023
095d602
Update marian-dev with newer fbgemm for clang
kpu Dec 7, 2023
007b578
Merge branch 'main' into terminology
kpu Dec 7, 2023
2417225
Merge branch 'main' into terminology
kpu Dec 7, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,9 @@ jobs:
ccache -s # Print current cache stats
ccache -z # Zero cache entry

python -m pip install --upgrade pip
pip install pybind11 pybind11-global

CIBW_BEFORE_BUILD_MACOS: |
brew install openblas protobuf ccache boost pybind11
chmod -R a+rwx ${{ env.ccache_dir }}
Expand Down Expand Up @@ -375,10 +378,8 @@ jobs:
python3 -m pip install black isort pytype
- name: "Formatting checks: black, isort"
run: |
python3 -m black --diff --check bindings/python/ setup.py doc/conf.py
python3 -m black --diff --check bindings/python/ setup.py doc/conf.py --exclude bindings/python/translator.py
python3 -m isort --profile black --diff --check bindings/python setup.py doc/conf.py
- name: "Static typing checks: pytype"
run: |-
python3 -m pytype bindings/python

docs:
Expand Down
3 changes: 0 additions & 3 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,3 @@
[submodule "bergamot-translator-tests"]
path = bergamot-translator-tests
url = https://github.com/browsermt/bergamot-translator-tests
[submodule "3rd_party/pybind11"]
path = 3rd_party/pybind11
url = https://github.com/pybind/pybind11.git
4 changes: 0 additions & 4 deletions 3rd_party/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,3 @@ get_directory_property(CMAKE_C_FLAGS DIRECTORY marian-dev DEFINITION CMAKE_C_FLA
get_directory_property(CMAKE_CXX_FLAGS DIRECTORY marian-dev DEFINITION CMAKE_CXX_FLAGS)
set(CMAKE_C_FLAGS ${CMAKE_C_FLAGS} PARENT_SCOPE)
set(CMAKE_CXX_FLAGS ${CMAKE_CXX_FLAGS} PARENT_SCOPE)

if(COMPILE_PYTHON)
add_subdirectory(pybind11)
endif(COMPILE_PYTHON)
1 change: 0 additions & 1 deletion 3rd_party/pybind11
Submodule pybind11 deleted from 9ec112
1 change: 0 additions & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,6 @@ cmake_dependent_option(ENABLE_CACHE_STATS "Enable stats on cache" ON "COMPILE_TE
# Set 3rd party submodule specific cmake options for this project
SET(COMPILE_CUDA OFF CACHE BOOL "Compile GPU version")
SET(USE_SENTENCEPIECE ON CACHE BOOL "Download and compile SentencePiece")
SET(USE_STATIC_LIBS ON CACHE BOOL "Link statically against non-system libs")
SET(SSPLIT_COMPILE_LIBRARY_ONLY ON CACHE BOOL "Do not compile ssplit tests")
if (USE_WASM_COMPATIBLE_SOURCE)
SET(COMPILE_LIBRARY_ONLY ON CACHE BOOL "Build only the Marian library and exclude all executables.")
Expand Down
43 changes: 43 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,3 +80,46 @@ A short example of how to use the APIs is provided in `app/bergamot.cpp` file.
### Using WASM version

Please follow the `README` inside the `wasm` folder of this repository that demonstrates how to use the translator in JavaScript.

### Using python API

Compile and install:
```
export CMAKE_BUILD_PARALLEL_LEVEL=8 # Use 8 cores to compile
pip install wheel
pip install .

# Desktop app
% bergamot-translator --help
bergamot-translator interfance
XapaJIaMnu marked this conversation as resolved.
Show resolved Hide resolved

options:
-h, --help show this help message and exit
--config CONFIG, -c CONFIG
Model YML configuration input.
--num-workers NUM_WORKERS, -n NUM_WORKERS
Number of CPU workers.
--logging LOGGING, -l LOGGING
Set verbosity level of logging: trace, debug, info, warn, err(or), critical, off. Default is off
--cache-size CACHE_SIZE
Cache size. 0 for caching is disabled
--terminology-tsv TERMINOLOGY_TSV, -t TERMINOLOGY_TSV
Path to a terminology file TSV
--force-terminology, -f
Force terminology to appear on the target side.
--path-to-input PATH_TO_INPUT, -i PATH_TO_INPUT
Path to input file. Uses stdin if empty
```
Using the python interface
```python
from bergamot.translator import Translator
print(Translator.__doc__)
translator = Translator("/path/to/model.npz.best-bleu.npz.decoder.brg.yml", terminology="/path/to/terminology.tsv")
translator.translate(["text"])
[output]
new_terminology = {}
new_terminology['srcwrd'] = "trgwrd"
translator.reset_terminology(new_terminology)
translator.translate(["text"])
[output_with_terminology]
```
2 changes: 1 addition & 1 deletion app/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
add_executable(bergamot bergamot.cpp)
target_link_libraries(bergamot PRIVATE bergamot-translator)
target_link_libraries(bergamot bergamot-translator)
1 change: 1 addition & 0 deletions bindings/python/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
find_package(pybind11 REQUIRED)
find_package(Python COMPONENTS Interpreter Development.Module REQUIRED)

message("Using Python: " ${Python_EXECUTABLE})
Expand Down
37 changes: 33 additions & 4 deletions bindings/python/bergamot.cpp
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
// #define PYBIND11_DETAILED_ERROR_MESSAGES // Enables debugging
#include <pybind11/iostream.h>
#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
Expand All @@ -12,6 +13,7 @@

#include <iostream>
#include <string>
#include <unordered_map>
#include <vector>

namespace py = pybind11;
Expand All @@ -28,7 +30,9 @@ using Alignment = std::vector<std::vector<float>>;
using Alignments = std::vector<Alignment>;

PYBIND11_MAKE_OPAQUE(std::vector<Response>);
PYBIND11_MAKE_OPAQUE(std::vector<size_t>);
PYBIND11_MAKE_OPAQUE(std::vector<std::string>);
PYBIND11_MAKE_OPAQUE(std::unordered_map<std::string, std::string>);
PYBIND11_MAKE_OPAQUE(Alignments);

class ServicePyAdapter {
Expand Down Expand Up @@ -116,6 +120,18 @@ class ServicePyAdapter {
return responses;
}

void setTerminology(py::dict terminology, bool forceTerminology = false) {
// It seems copying is not too bad for performance. Also this should happen rarely and with small objects
// https://github.com/pybind/pybind11/issues/3033
std::unordered_map<std::string, std::string> cppTerminology;
for (std::pair<py::handle, py::handle> item : terminology) {
auto key = item.first.cast<std::string>();
auto value = item.second.cast<std::string>();
cppTerminology[key] = value;
}
service_.setTerminology(cppTerminology, forceTerminology);
}

private /*functions*/:
static Service make_service(const Service::Config &config) {
py::scoped_ostream_redirect outstream(std::cout, // std::ostream&
Expand Down Expand Up @@ -195,19 +211,32 @@ PYBIND11_MODULE(_bergamot, m) {
.def("modelFromConfig", &ServicePyAdapter::modelFromConfig)
.def("modelFromConfigPath", &ServicePyAdapter::modelFromConfigPath)
.def("translate", &ServicePyAdapter::translate)
.def("pivot", &ServicePyAdapter::pivot);
.def("pivot", &ServicePyAdapter::pivot)
.def("setTerminology", &ServicePyAdapter::setTerminology);

py::bind_vector<std::vector<size_t>>(m, "VectorSizeT");
py::class_<Service::Config>(m, "ServiceConfig")
.def(py::init<>([](size_t numWorkers, size_t cacheSize, std::string logging) {
.def(py::init<>([](size_t numWorkers, std::vector<size_t> gpuWorkers, size_t cacheSize, std::string logging,
std::string pathToTerminologyFile, bool terminologyForce, std::string terminologyForm) {
Service::Config config;
config.numWorkers = numWorkers;
config.gpuWorkers = gpuWorkers;
config.cacheSize = cacheSize;
config.logger.level = logging;
config.terminologyFile = pathToTerminologyFile;
config.terminologyForce = terminologyForce;
config.format = terminologyForm;
return config;
}),
py::arg("numWorkers") = 1, py::arg("cacheSize") = 0, py::arg("logLevel") = "off")
py::arg("numWorkers") = 1, py::arg("gpuWorkers") = std::vector<size_t>{}, py::arg("cacheSize") = 0,
py::arg("logLevel") = "off", py::arg("pathToTerminologyFile") = "", py::arg("terminologyForce") = false,
py::arg("terminologyForm") = "%s <tag0> %s </tag0> ")
.def_readwrite("numWorkers", &Service::Config::numWorkers)
.def_readwrite("cacheSize", &Service::Config::cacheSize);
.def_readwrite("gpuWorkers", &Service::Config::gpuWorkers)
.def_readwrite("cacheSize", &Service::Config::cacheSize)
.def_readwrite("pathToTerminologyFile", &Service::Config::terminologyFile)
.def_readwrite("terminologyForce", &Service::Config::terminologyForce)
.def_readwrite("terminologyForm", &Service::Config::format);

py::class_<_Model, std::shared_ptr<_Model>>(m, "TranslationModel");
}
212 changes: 212 additions & 0 deletions bindings/python/translator.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
#!/usr/bin/env python3
import argparse
from sys import stdin
from typing import Dict, List

import bergamot # type: ignore


class Translator:
"""Bergamot translator interfacing with the C++ code.

Attributes:
_num_workers Number of parallel CPU workers.
_gpu_workers Indices of the GPU devices used. _num_workers must be set to zero!
_cache: Cache size. 0 to disable cache.
_logging: Log level: trace, debug, info, warn, err(or), critical, off. Default is off
_terminology: Path to a TSV terminology file
_force_terminology Force the terminology to appear on the target side. May affect translation quality negatively.
_format Format of the terminology string

_config Translation model config
_model: Translation model
_responseOpts What to include in the response (alignment, html restoration, etc..)
_service The translation service
"""

_num_workers: int
_gpu_workers: List[int]
_cache: int
_logging: str
_terminology: str
_force_terminology: bool
_terminology_form: str

_config: bergamot.ServiceConfig
_model: bergamot.TranslationModel
_responseOpts: bergamot.ResponseOptions
_service: bergamot.Service

def __init__(
self,
model_config_path: str,
num_workers: int = 1,
gpu_workers: List[int] = [],
cache: int = 0,
logging="off",
terminology: str = "",
force_terminology: bool = False,
terminology_form: str = "%s __target__ %s __done__ ",
):
"""Initialises the translator class

:param model_config_path: Path to the configuration file for the translation model.
:param num_workers: Number of CPU workers.
:param gpu_workers: Indices of the GPU devices. num_workers must be zero if this is non-empty
:param cache: cache size. 0 means no cache.
:param logging: Log level: trace, debug, info, warn, err(or), critical, off.
:param terminology: Path to terminology file, TSV format
:param force_terminology: Force terminology to appear on the target side. May impact translation quality.
"""
self._num_workers = num_workers
self._gpu_workers = gpu_workers
self._cache = cache
self._logging = logging
self._terminology = terminology
self._force_terminology = force_terminology
self._terminology_form = terminology_form

self._config = bergamot.ServiceConfig(
self._num_workers,
bergamot.VectorSizeT(self._gpu_workers),
self._cache,
self._logging,
self._terminology,
self._force_terminology,
self._terminology_form,
)
self._service = bergamot.Service(self._config)
self._responseOpts = (
bergamot.ResponseOptions()
) # Default false for all, if we want to enable HTML later, from here
self._model = self._service.modelFromConfigPath(model_config_path)

def reset_terminology(
self, terminology: str = "", force_terminology: bool = False
) -> None:
"""Resets the terminology of the model
:param terminology: path to the terminology file.
:param force_terminology: force terminology
:return: None
"""
self._terminology = terminology
self._force_terminology = force_terminology
self._config = bergamot.ServiceConfig(
self._num_workers,
bergamot.VectorSizeT(self._gpu_workers),
self._cache,
self._logging,
self._terminology,
self._force_terminology,
self._terminology_form,
)
self._service = bergamot.Service(self._config)

def reset_terminology(
self, terminology: Dict[str, str], force_terminology: bool = False
) -> None:
"""Resets the terminology of the model
:param terminology: Dictionary that maps source words to their target side terminology
:param force_terminology: force terminology
:return: None
"""
self._service.setTerminology(terminology, force_terminology)

def reset_num_workers(self, num_workers) -> None:
"""Resets the number of workers
:param num_workers: number of parallel CPU threads.
:return: None
"""
self._num_workers = num_workers
self._config = bergamot.ServiceConfig(
self._num_workers,
bergamot.VectorSizeT(self._gpu_workers),
self._cache,
self._logging,
self._terminology,
self._force_terminology,
self._terminology_form,
)
self._service = bergamot.Service(self._config)

def reset_gpu_workers(self, gpu_workers: List[int]) -> None:
"""Resets the number of GPU workers
:param gpu_workers: Indices of the GPU devices to be used.
:return: None
"""
self._gpu_workers = gpu_workers
self._config = bergamot.ServiceConfig(
self._num_workers,
bergamot.VectorSizeT(self._gpu_workers),
self._cache,
self._logging,
self._terminology,
self._force_terminology,
self._terminology_form,
)
self._service = bergamot.Service(self._config)

def translate(self, sentences: List[str]) -> List[str]:
"""Translates a list of strings
:param sentences: A List of strings to be translated.
:return: A list of translation outputs.
"""
responses = self._service.translate(
self._model, bergamot.VectorString(sentences), self._responseOpts
)
return [response.target.text for response in responses]

# @TODO add async translate with futures


def main():
parser = argparse.ArgumentParser(description="bergamot-translator interface")
parser.add_argument("--config", '-c', required=True, type=str, help='Model YML configuration input.')
parser.add_argument("--num-workers", '-n', type=int, default=1, help='Number of CPU workers.')
parser.add_argument("--num-gpus", "-g", type=int, action='append', nargs='+', default=None, help='List of GPUs to use.')
parser.add_argument("--logging", '-l', type=str, default="off", help='Set verbosity level of logging: trace, debug, info, warn, err(or), critical, off. Default is off')
parser.add_argument("--cache-size", type=int, default=0, help='Cache size. 0 for caching is disabled')
parser.add_argument("--terminology-tsv", '-t', default="", type=str, help='Path to a terminology file TSV')
parser.add_argument("--force-terminology", '-f', action="store_true", help='Force terminology to appear on the target side.')
parser.add_argument("--terminology-form", type=str, default="%s __target__ %s __done__ ", help='"Form for terminology. Default is "%%s __target__ %%s __done__ "')
parser.add_argument("--path-to-input", '-i', default=None, type=str, help="Path to input file. Uses stdin if empty")
parser.add_argument("--batch", '-b', default=32, type=int, help="Number of lines to process in a batch")
args = parser.parse_args()

if args.num_gpus is None:
num_gpus = []
else:
num_gpus = args.num_gpus[0]
translator = Translator(args.config, args.num_workers, num_gpus, args.cache_size, args.logging, args.terminology_tsv, args.force_terminology, args.terminology_form)


if args.path_to_input is None:
infile = stdin
else:
infile = open(args.path_to_input, "r", encoding="utf-8")

# In this example, each block of input text (i.e. a document) is a line.
# If you're using the API directly, feel free to include newlines in the
# block of text. We aim to preserve whitespace at sentence boundaries.

# Buffer input text to allow the backend to parallelize. We recommend
# there be about 16 sentences per worker (thread). Note that blocks of
# text are internally split into sentences, so the number of sentences is
# typically larger than the length of the list of blocks provided.
buffer = []
for line in infile:
buffer.append(line.strip())
if len(buffer) >= args.batch:
print("\n".join(translator.translate(buffer)))
buffer = []

# Flush buffer
if len(buffer) > 0:
print("\n".join(translator.translate(buffer)))

if args.path_to_input is not None:
infile.close()


if __name__ == "__main__":
main()
Loading
Loading