Terminology #448

XapaJIaMnu · 2023-05-26T11:45:21Z

Basic Python interface + installation instructions

jelmervdl · 2023-05-26T12:20:04Z

Wouldn't it be much simpler to make the terminology map a member of the ResponseOptions class, and map it to a dict? Then you can change it per request, and you need less plumbing and no file IO for passing on that info.

I don't know if ResponseOptions is an immutable struct at this point. I don't think we have an async API on the python side so there's no risk of Python changing the terminology during a translation. But if you do want to implement that, its something to take into account.

README.md

jelmervdl · 2023-05-26T12:26:20Z

src/translator/service.cpp

+        std::string srcword;
+        std::string replacementword;
+        getline(ss, srcword, '\t');
+        getline(ss, replacementword, '\n');


Make sure people don't use windows line endings, hehe. Incidentally, if we could do this terminology map loading in Python you'd get their line ending strippping code for free.

src/translator/service.cpp

jelmervdl · 2023-05-26T12:43:23Z

src/translator/service.cpp

+        getline(ss, replacementword, '\n');
+        // @TODO it seems like removing the tags forces the model to copy which is
+        // I guess just as good and more reliable. In that case we just don't tell the model
+        // what the original source is and it just has no choice BUT to generate the target.


Edit: ah you say it here explicitly. Copy is a copy with the assumption it won't try to translate because it doesn't know the translation. For Chinese <-> English I can imagine this working, but no way that English <-> French would accept something like that… right?

Typically the model just learns to copy when there's accidentally target text on the source side. I expected it it would work just fine for English <-> french. Would be a problem with multilingual models, potentially.

bindings/python/translator.py

3rd_party/CMakeLists.txt

- Move it to its own file - Prefer replacing longer sequences above shorter ones - Don't replace terminology found in the just replaced bits

jelmervdl · 2023-06-15T10:37:51Z

Still on my wish list, but not a priority for a first release:

Move terminology config to the ResponseOptions struct. This allows it to be configured per request, and matches other options like alignment and HTML. I still want to do the same with the Cache as well some day. I briefly looked into how hard this would be to do, but the way the C++ demo app parses command line options doesn't help.
Do terminology replacement after HTML parsing. Right now it happens before, which could invalidate the HTML and will mess up things if HTML-like tags are used for terminology. I looked at moving it to after HTML parsing, but that needs a bit more effort. The HTML parsing tracks byte offsets of the text it outputs, and expects the text that is translated (and for which we get alignment info after translation) is exactly the same as the text it produced. If any terminology replacements are added after HTML, but before translation, this doesn't hold. There is a data structure, AnnotatedText, that might be extended to do this task of tracking updated mappings. Using something like this was always the plan for HTML, and having a more expendable more onion-like structure, but we never got around to refactoring it.
Add pre/post processing that makes sure the terminology formatting cannot occur in the input string by replacing it with a placeholder, or doing something like the HTML transfer to copy it to the output.

XapaJIaMnu · 2023-06-15T11:08:51Z

Before we push to main i'd like to fix some of the CI (as building the wheel is broken atm, onnxgemm requires some updating etc).
@kpu how does the interface look to you?

jelmervdl · 2023-07-10T17:18:58Z

Todo: instead of a --terminology-form cli option, it should go in the model yaml.

graemenail · 2023-07-10T17:25:19Z

Todo: instead of a --terminology-form cli option, it should go in the model yaml.

Should we nest application specific augments e.g.

bergamot-translator-app:
  terminology: terms.tsv
  terminology-form: "__source__ {src} __target {trg} __done__"

to avoid any future name conflicts with upstream Marian

jelmervdl · 2023-07-10T17:28:04Z

Don't know if Marian's yaml parser can handle that, but if we can do that, it might be a good future-proofing.

Ideally there would be no collisions because this would be the marian implementation but maybe that's slightly unrealistic.

graemenail · 2023-07-10T18:11:21Z

I would guess the parser library supports it, but from a quick look it's not wired up on the Marian side. But ideally Marian wouldn't need to parse it, just ignore it since terminology (for now?) is just implemented in bergamot-translator.

But on the ignoring unknown args:

./marian -c tmp.yml 
[2023-07-10 17:59:32] Error: There are option(s) in a config file that are not expected: bergamot-translator-app

(and similarly directly on the cli)

XapaJIaMnu · 2023-08-02T00:46:50Z

Now GPU support is working. Reliably from the C++ test app, not so much from python.

The python interface randomly deadlocks, and I have no idea why. I didn't have this problem before so I blame the ensemble change, but i haven't bisected, or anything. FML.

The deadlocks happen even when compiling without CUDA compiled in. I guess I should be attaching GDB...

XapaJIaMnu · 2023-08-02T01:04:53Z

This is where we are at:

0x00007f2a3dd192ed in syscall () from /usr/lib/libc.so.6
(gdb) up
#1  0x00007f2a1a8de76a in std::__atomic_futex_unsigned_base::_M_futex_wait_until (this=<optimized out>, __addr=0x55d7e9141070, __val=2147483648, __has_timeout=<optimized out>, __s=..., __ns=...)
    at /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/futex.cc:122
Downloading source file /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/futex.cc
122     /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/futex.cc: Directory not empty.                                                                                                                                    
(gdb) up
#2  0x00007f2a384e4b71 in std::__atomic_futex_unsigned<2147483648u>::_M_load_and_test_until (this=0x55d7e9141070, __assumed=0, __operand=1, __equal=true, __mo=std::memory_order_acquire, __has_timeout=false, 
    __s=std::chrono::duration = { 0s }, __ns=std::chrono::duration = { 0ns }) at /usr/lib/gcc/x86_64-pc-linux-gnu/12.3.0/include/c++/bits/atomic_futex.h:109
109               bool __ret = _M_futex_wait_until((unsigned*)(void*)&_M_data,
(gdb) up
#3  0x00007f2a384cb2ba in std::__atomic_futex_unsigned<2147483648u>::_M_load_and_test (this=0x55d7e9141070, __assumed=0, __operand=1, __equal=true, __mo=std::memory_order_acquire)
    at /usr/lib/gcc/x86_64-pc-linux-gnu/12.3.0/include/c++/bits/atomic_futex.h:158
158           return _M_load_and_test_until(__assumed, __operand, __equal, __mo,
(gdb) up
#4  0x00007f2a384a1ef8 in std::__atomic_futex_unsigned<2147483648u>::_M_load_when_equal (__mo=std::memory_order_acquire, __val=1, this=0x55d7e9141070)
    at /usr/lib/gcc/x86_64-pc-linux-gnu/12.3.0/include/c++/bits/atomic_futex.h:212
212           _M_load_and_test(__i, __val, true, __mo);
(gdb) up
#5  std::__future_base::_State_baseV2::wait (this=0x55d7e9141060) at /usr/lib/gcc/x86_64-pc-linux-gnu/12.3.0/include/c++/future:351
351             _M_status._M_load_when_equal(_Status::__ready, memory_order_acquire);
(gdb) up
#6  0x00007f2a384b4204 in std::__basic_future<marian::bergamot::Response>::wait (this=0x55d7e8790800) at /usr/lib/gcc/x86_64-pc-linux-gnu/12.3.0/include/c++/future:714
714             _M_state->wait();
(gdb) up
#7  0x00007f2a384a31b6 in ServicePyAdapter::translate (this=0x55d7e877a1e0, model=std::shared_ptr<marian::bergamot::TranslationModel> (use count 5, weak count 0) = {...}, 
    inputs=std::vector of length 10, capacity 10 = {...}, options=...) at /home/dheart/uni_stuff/postdoc/bergamot-translator/bindings/python/bergamot.cpp:82
82            futures[i].wait();

The code in question is here:

bergamot-translator/bindings/python/bergamot.cpp

Line 116 in 1e80e79

futures[i].wait();

What's happening? For some reason a future just takes forever while no computation is happening...

XapaJIaMnu · 2023-08-02T09:05:20Z

Reverting: git revert 4b0da8d434e5a688139255873afd177f647ef777 fixes the issue. @graemenail not sure why.
Blaming: 4b0da8d
On the bright side, python gpu works on a 3090 with a modern CUDA.

XapaJIaMnu · 2023-08-02T10:56:44Z

How to compile the python interface with CUDA support:

export COMPILE_CUDA=ON 
export CMAKE_BUILD_PARALLEL_LEVEL=16
pip install .

Invoke as:

bergamot-translator -c model.npz.best-bleu.npz.decoder.brg.yml -i dataset -n 0 -g 0 -l info

bindings/python/translator.py

jelmervdl · 2023-08-09T21:26:37Z

src/translator/service.cpp

+      std::string srcword;
+      std::string replacementword;
+      getline(ss, srcword, '\t');
+      getline(ss, replacementword, '\n');  // BEWARE of windows file ndings


This is why I would have liked it if we could have kept the "read the tsv file" part out of the bergamot C++ codebase, and just implement it in Python/Qt/JavaScript which are much better at integrating with the local environment.

… But treat this complaint so I'll stop complaining about this from now on!

Well, functionally, we can bypass the redundant terminology implementation, by applying terminology constraints directly. Maybe we can add this to the readme. I would like to keep a C++ implementation at the very least as documentation.

src/translator/service.cpp

jelmervdl · 2023-08-09T21:45:20Z

src/translator/service.cpp

@@ -202,12 +283,14 @@ void AsyncService::pivot(std::shared_ptr<TranslationModel> first, std::shared_pt
 void AsyncService::translate(std::shared_ptr<TranslationModel> translationModel, std::string &&source,
                             CallbackType callback, const ResponseOptions &responseOptions) {
  // Producer thread, a call to this function adds new work items. If batches are available, notifies workers waiting.
+  // Tagging
+  if (!terminologyMap_.empty()) source = ReplaceTerminology(std::move(source), terminologyMap_);


std::string ReplaceTerminology(std::string const &str, TerminologyMap const &terminology);. I think I made the std::move pointless when I changed the ReplaceTerminilogy implementation.

jelmervdl · 2023-08-09T21:46:27Z

src/translator/service.h

+      app.add_option("--terminology-form", config.format,
+                     "Form for technology. Default is \"%s __target__ %s __done__ \". Change depending on the model.");


Do we want to figure out how to store this in the model yaml instead before we merge?

Pedantically, we probably should.

jelmervdl · 2023-08-09T21:50:49Z

src/translator/terminology.cpp

+
+}  // namespace
+
+std::string ReplaceTerminology(std::string const &str, TerminologyMap const &terminology) {


Once TerminologyMap becomes very large, this method is going to be stupidly slow. At that point we should change it into a state machine type of search thingy.

If terminology map is very large, i feel something wrong is happening.
State machine would always be better but also I feel this would be a time sink... ;D

Update marian-dev

XapaJIaMnu and others added 7 commits May 9, 2023 13:29

Basic terminology API

c43c57a

reference to the code

06ec79b

Update marian with gcc 12

867fc6c

WiP python iface

9153041

More WiP

e5977e6

Works except stdin

b1cb3bc

Python interface

7a93f7c

XapaJIaMnu requested review from jelmervdl and kpu May 26, 2023 11:45

jelmervdl reviewed May 26, 2023

View reviewed changes

XapaJIaMnu and others added 11 commits May 26, 2023 15:00

Merge branch 'main' into terminology

5be7b96

Small fixes, removes pybind submodule

c1a659e

Allow dictionary maps. Work in progress

1f8ba76

Convert the map to python map

cc44014

Allow dictionary terminology set up

6c7fe75

Attempt to install pybind11 for the wheel build

c586e09

Merge branch 'main' into terminology

26529dc

Add support for different terminology format

82cc687

Try to update the workflows.

5c9161b

Refactor terminology replace

7d6f4e5

- Move it to its own file - Prefer replacing longer sequences above shorter ones - Don't replace terminology found in the just replaced bits

Fix formatting

f53879d

jelmervdl force-pushed the terminology branch from 40e4c1c to f53879d Compare June 15, 2023 10:20

XapaJIaMnu and others added 5 commits June 18, 2023 15:32

Update marian dev which should allow for compilation on newer platforms

a95001d

Fix for latest argparse

316c5dd

technology -> terminology

58e5363

Buffer input for efficiency

0a6be45

Pass terminology_form from CLI to Translator

ca37e8f

XapaJIaMnu added 3 commits August 1, 2023 19:55

Enable cuda compilation

19ca40d

Merge branch 'main' into terminology

1a8b90c

Working, except in python

1e80e79

XapaJIaMnu added 10 commits August 2, 2023 12:09

Simplify invocation a bit

3d37edf

Formatting fixes

e5d4ed0

Update the terminology format

72ade1d

Merge branch 'main' into terminology

5f9858f

Use 0 GPU workers by default

168d589

Attempt to fix tests

3eab045

Fix error in workflow syntax

88e7f28

Fix typing error

1db9d09

I hate python linters

537f4e1

pytype can't access C++ modules

042acc2

jelmervdl reviewed Aug 9, 2023

View reviewed changes

XapaJIaMnu and others added 8 commits August 11, 2023 15:18

Small fixes

e3b4a7c

Merge branch 'main' into terminology

05a7379

Merge with main

5479c20

Merge branch 'main' into terminology

d2356a6

Update marian-dev

Pull in submodule fixing clang compilation

97c8da4

Update marian-dev with newer fbgemm for clang

095d602

Merge branch 'main' into terminology

007b578

Merge branch 'main' into terminology

2417225

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Terminology #448

Terminology #448

XapaJIaMnu commented May 26, 2023

jelmervdl commented May 26, 2023

jelmervdl May 26, 2023

jelmervdl May 26, 2023

XapaJIaMnu May 26, 2023

jelmervdl commented Jun 15, 2023 •

edited

Loading

XapaJIaMnu commented Jun 15, 2023

jelmervdl commented Jul 10, 2023

graemenail commented Jul 10, 2023

jelmervdl commented Jul 10, 2023

graemenail commented Jul 10, 2023

XapaJIaMnu commented Aug 2, 2023 •

edited

Loading

XapaJIaMnu commented Aug 2, 2023

XapaJIaMnu commented Aug 2, 2023 •

edited

Loading

XapaJIaMnu commented Aug 2, 2023 •

edited

Loading

jelmervdl Aug 9, 2023

XapaJIaMnu Aug 11, 2023

jelmervdl Aug 9, 2023

jelmervdl Aug 9, 2023

XapaJIaMnu Aug 11, 2023

jelmervdl Aug 9, 2023

XapaJIaMnu Aug 11, 2023 •

edited

Loading

		app.add_option("--terminology-form", config.format,
		"Form for technology. Default is \"%s __target__ %s __done__ \". Change depending on the model.");


		} // namespace

		std::string ReplaceTerminology(std::string const &str, TerminologyMap const &terminology) {

Terminology #448

Are you sure you want to change the base?

Terminology #448

Conversation

XapaJIaMnu commented May 26, 2023

jelmervdl commented May 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jelmervdl commented Jun 15, 2023 • edited Loading

XapaJIaMnu commented Jun 15, 2023

jelmervdl commented Jul 10, 2023

graemenail commented Jul 10, 2023

jelmervdl commented Jul 10, 2023

graemenail commented Jul 10, 2023

XapaJIaMnu commented Aug 2, 2023 • edited Loading

XapaJIaMnu commented Aug 2, 2023

XapaJIaMnu commented Aug 2, 2023 • edited Loading

XapaJIaMnu commented Aug 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

XapaJIaMnu Aug 11, 2023 • edited Loading

Choose a reason for hiding this comment

jelmervdl commented Jun 15, 2023 •

edited

Loading

XapaJIaMnu commented Aug 2, 2023 •

edited

Loading

XapaJIaMnu commented Aug 2, 2023 •

edited

Loading

XapaJIaMnu commented Aug 2, 2023 •

edited

Loading

XapaJIaMnu Aug 11, 2023 •

edited

Loading