Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terminology #448

Open
wants to merge 45 commits into
base: main
Choose a base branch
from
Open

Terminology #448

wants to merge 45 commits into from

Conversation

XapaJIaMnu
Copy link
Collaborator

Basic Python interface + installation instructions

@XapaJIaMnu XapaJIaMnu requested review from jelmervdl and kpu May 26, 2023 11:45
@jelmervdl
Copy link
Member

Wouldn't it be much simpler to make the terminology map a member of the ResponseOptions class, and map it to a dict? Then you can change it per request, and you need less plumbing and no file IO for passing on that info.

I don't know if ResponseOptions is an immutable struct at this point. I don't think we have an async API on the python side so there's no risk of Python changing the terminology during a translation. But if you do want to implement that, its something to take into account.

README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
std::string srcword;
std::string replacementword;
getline(ss, srcword, '\t');
getline(ss, replacementword, '\n');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure people don't use windows line endings, hehe. Incidentally, if we could do this terminology map loading in Python you'd get their line ending strippping code for free.

src/translator/service.cpp Outdated Show resolved Hide resolved
src/translator/service.cpp Outdated Show resolved Hide resolved
getline(ss, replacementword, '\n');
// @TODO it seems like removing the tags forces the model to copy which is
// I guess just as good and more reliable. In that case we just don't tell the model
// what the original source is and it just has no choice BUT to generate the target.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: ah you say it here explicitly. Copy is a copy with the assumption it won't try to translate because it doesn't know the translation. For Chinese <-> English I can imagine this working, but no way that English <-> French would accept something like that… right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically the model just learns to copy when there's accidentally target text on the source side. I expected it it would work just fine for English <-> french. Would be a problem with multilingual models, potentially.

bindings/python/translator.py Outdated Show resolved Hide resolved
3rd_party/CMakeLists.txt Outdated Show resolved Hide resolved
@jelmervdl
Copy link
Member

jelmervdl commented Jun 15, 2023

Still on my wish list, but not a priority for a first release:

  1. Move terminology config to the ResponseOptions struct. This allows it to be configured per request, and matches other options like alignment and HTML. I still want to do the same with the Cache as well some day. I briefly looked into how hard this would be to do, but the way the C++ demo app parses command line options doesn't help.
  2. Do terminology replacement after HTML parsing. Right now it happens before, which could invalidate the HTML and will mess up things if HTML-like tags are used for terminology. I looked at moving it to after HTML parsing, but that needs a bit more effort. The HTML parsing tracks byte offsets of the text it outputs, and expects the text that is translated (and for which we get alignment info after translation) is exactly the same as the text it produced. If any terminology replacements are added after HTML, but before translation, this doesn't hold. There is a data structure, AnnotatedText, that might be extended to do this task of tracking updated mappings. Using something like this was always the plan for HTML, and having a more expendable more onion-like structure, but we never got around to refactoring it.
  3. Add pre/post processing that makes sure the terminology formatting cannot occur in the input string by replacing it with a placeholder, or doing something like the HTML transfer to copy it to the output.

@XapaJIaMnu
Copy link
Collaborator Author

Before we push to main i'd like to fix some of the CI (as building the wheel is broken atm, onnxgemm requires some updating etc).
@kpu how does the interface look to you?

@jelmervdl
Copy link
Member

Todo: instead of a --terminology-form cli option, it should go in the model yaml.

@graemenail
Copy link
Member

Todo: instead of a --terminology-form cli option, it should go in the model yaml.

Should we nest application specific augments e.g.

bergamot-translator-app:
  terminology: terms.tsv
  terminology-form: "__source__ {src} __target {trg} __done__"

to avoid any future name conflicts with upstream Marian

@jelmervdl
Copy link
Member

Don't know if Marian's yaml parser can handle that, but if we can do that, it might be a good future-proofing.

Ideally there would be no collisions because this would be the marian implementation but maybe that's slightly unrealistic.

@graemenail
Copy link
Member

I would guess the parser library supports it, but from a quick look it's not wired up on the Marian side. But ideally Marian wouldn't need to parse it, just ignore it since terminology (for now?) is just implemented in bergamot-translator.

But on the ignoring unknown args:

./marian -c tmp.yml 
[2023-07-10 17:59:32] Error: There are option(s) in a config file that are not expected: bergamot-translator-app

(and similarly directly on the cli)

@XapaJIaMnu
Copy link
Collaborator Author

XapaJIaMnu commented Aug 2, 2023

Now GPU support is working. Reliably from the C++ test app, not so much from python.

The python interface randomly deadlocks, and I have no idea why. I didn't have this problem before so I blame the ensemble change, but i haven't bisected, or anything. FML.

The deadlocks happen even when compiling without CUDA compiled in. I guess I should be attaching GDB...

@XapaJIaMnu
Copy link
Collaborator Author

This is where we are at:

0x00007f2a3dd192ed in syscall () from /usr/lib/libc.so.6
(gdb) up
#1  0x00007f2a1a8de76a in std::__atomic_futex_unsigned_base::_M_futex_wait_until (this=<optimized out>, __addr=0x55d7e9141070, __val=2147483648, __has_timeout=<optimized out>, __s=..., __ns=...)
    at /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/futex.cc:122
Downloading source file /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/futex.cc
122     /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/futex.cc: Directory not empty.                                                                                                                                    
(gdb) up
#2  0x00007f2a384e4b71 in std::__atomic_futex_unsigned<2147483648u>::_M_load_and_test_until (this=0x55d7e9141070, __assumed=0, __operand=1, __equal=true, __mo=std::memory_order_acquire, __has_timeout=false, 
    __s=std::chrono::duration = { 0s }, __ns=std::chrono::duration = { 0ns }) at /usr/lib/gcc/x86_64-pc-linux-gnu/12.3.0/include/c++/bits/atomic_futex.h:109
109               bool __ret = _M_futex_wait_until((unsigned*)(void*)&_M_data,
(gdb) up
#3  0x00007f2a384cb2ba in std::__atomic_futex_unsigned<2147483648u>::_M_load_and_test (this=0x55d7e9141070, __assumed=0, __operand=1, __equal=true, __mo=std::memory_order_acquire)
    at /usr/lib/gcc/x86_64-pc-linux-gnu/12.3.0/include/c++/bits/atomic_futex.h:158
158           return _M_load_and_test_until(__assumed, __operand, __equal, __mo,
(gdb) up
#4  0x00007f2a384a1ef8 in std::__atomic_futex_unsigned<2147483648u>::_M_load_when_equal (__mo=std::memory_order_acquire, __val=1, this=0x55d7e9141070)
    at /usr/lib/gcc/x86_64-pc-linux-gnu/12.3.0/include/c++/bits/atomic_futex.h:212
212           _M_load_and_test(__i, __val, true, __mo);
(gdb) up
#5  std::__future_base::_State_baseV2::wait (this=0x55d7e9141060) at /usr/lib/gcc/x86_64-pc-linux-gnu/12.3.0/include/c++/future:351
351             _M_status._M_load_when_equal(_Status::__ready, memory_order_acquire);
(gdb) up
#6  0x00007f2a384b4204 in std::__basic_future<marian::bergamot::Response>::wait (this=0x55d7e8790800) at /usr/lib/gcc/x86_64-pc-linux-gnu/12.3.0/include/c++/future:714
714             _M_state->wait();
(gdb) up
#7  0x00007f2a384a31b6 in ServicePyAdapter::translate (this=0x55d7e877a1e0, model=std::shared_ptr<marian::bergamot::TranslationModel> (use count 5, weak count 0) = {...}, 
    inputs=std::vector of length 10, capacity 10 = {...}, options=...) at /home/dheart/uni_stuff/postdoc/bergamot-translator/bindings/python/bergamot.cpp:82
82            futures[i].wait();

The code in question is here:

What's happening? For some reason a future just takes forever while no computation is happening...

@XapaJIaMnu
Copy link
Collaborator Author

XapaJIaMnu commented Aug 2, 2023

Reverting: git revert 4b0da8d434e5a688139255873afd177f647ef777 fixes the issue. @graemenail not sure why.
Blaming: 4b0da8d
On the bright side, python gpu works on a 3090 with a modern CUDA.

@XapaJIaMnu
Copy link
Collaborator Author

XapaJIaMnu commented Aug 2, 2023

How to compile the python interface with CUDA support:

export COMPILE_CUDA=ON 
export CMAKE_BUILD_PARALLEL_LEVEL=16
pip install .

Invoke as:

bergamot-translator -c model.npz.best-bleu.npz.decoder.brg.yml -i dataset -n 0 -g 0 -l info

bindings/python/translator.py Outdated Show resolved Hide resolved
std::string srcword;
std::string replacementword;
getline(ss, srcword, '\t');
getline(ss, replacementword, '\n'); // BEWARE of windows file ndings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why I would have liked it if we could have kept the "read the tsv file" part out of the bergamot C++ codebase, and just implement it in Python/Qt/JavaScript which are much better at integrating with the local environment.

… But treat this complaint so I'll stop complaining about this from now on!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, functionally, we can bypass the redundant terminology implementation, by applying terminology constraints directly. Maybe we can add this to the readme. I would like to keep a C++ implementation at the very least as documentation.

src/translator/service.cpp Outdated Show resolved Hide resolved
@@ -202,12 +283,14 @@ void AsyncService::pivot(std::shared_ptr<TranslationModel> first, std::shared_pt
void AsyncService::translate(std::shared_ptr<TranslationModel> translationModel, std::string &&source,
CallbackType callback, const ResponseOptions &responseOptions) {
// Producer thread, a call to this function adds new work items. If batches are available, notifies workers waiting.
// Tagging
if (!terminologyMap_.empty()) source = ReplaceTerminology(std::move(source), terminologyMap_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::string ReplaceTerminology(std::string const &str, TerminologyMap const &terminology);. I think I made the std::move pointless when I changed the ReplaceTerminilogy implementation.

Comment on lines +130 to +131
app.add_option("--terminology-form", config.format,
"Form for technology. Default is \"%s __target__ %s __done__ \". Change depending on the model.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to figure out how to store this in the model yaml instead before we merge?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pedantically, we probably should.


} // namespace

std::string ReplaceTerminology(std::string const &str, TerminologyMap const &terminology) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once TerminologyMap becomes very large, this method is going to be stupidly slow. At that point we should change it into a state machine type of search thingy.

Copy link
Collaborator Author

@XapaJIaMnu XapaJIaMnu Aug 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If terminology map is very large, i feel something wrong is happening.
State machine would always be better but also I feel this would be a time sink... ;D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants