-
Notifications
You must be signed in to change notification settings - Fork 38
TranslationFlow
This document discusses the flow of a translation request through the translation service. It is currently a live discussion document and things will change as the discussion progresses. Note that my description does not accurately reflect what's currently going on in mts (currently workers process MaxiBatches directly). To facilitate discussion, please respond inline (with MarkDown >, >>, etc) and 'sign' your contributions with your initials. - UG
The original document was posted by UG, so any text that's not quoted and signed should be assumed to come from UG.
Contributors to this document (please add yourself with your name and initials if you comment inline):
- Ulrich Germann (UG)
- Jerin Philip (JP)
- Kennneth Heafield (KH)
- Abhishek Aggarwal (AA)
KH Sigh, are we redesigning the API again?
UG I have no strong feelings and am fairly flexible about this. That said, I think there are benefits to returning a struct right away with a future to the result as a member rather than a future to the overall structure, as it will allow the client to monitor progress if they want to. If the client gets a future to the completed request, it can't monitor progress. Excact use case for progress monitoring TBD. Could just be a translation progress bar somewhere. The main point of this write-up is to discuss internal processing, not API. This sections just lays out what input I expect, regardless of how exactly I get it. I really don't care that much about the minutiae of the API design.
JP By API, Bergamot API? I think we shouldn't. My understanding is to be only bothered about a concrete implementation of AbstractTranslationModel::translate. The way it's designed currently, no multilingual models like Johnson (2016)? Are you talking something different?
AA: Can anyone share that what sort of information can we provide to the client regarding the progress of the request? and What all will a client be able to do with this information? As per my understanding, client can either cancel the request or wait (via future) for it to get finished. We can leave it to the client to decide on what do until the future gets resolved. e.g. client can show progress bar (as Uli suggested) in UI until the future gets resolved.
I don't think we need a struct for this (the following can just be function call parameters), except maybe for the translation options. - UG
- input (string): The string to be translated
(JP: Is this bounded in length?)
UG: At some point we'll also need to add errors and warnings to the response. It should be bounded in length, but ultimately it's the number of tokens and not the number of bytes that matters, so we'll need two checks: one for string length and one for number of tokens. If the sentence is too long, we have two options: chop it up and add a warning that it was chopped up, or refuse it with an appropriate error message.
KH This came up on the call. Chop. I don't think we need a warning yet. The cap should ideally be denominated in tokens.
AA: We agreed in plenary as Kenneth mentioned that A limit to input, imposed by the consumer of the API primarily. Therefore, I assume we are not placing any cap on length of input on API level. Agreed with Uli on adding Errors/Warnings in the response in general (not just for chopping). When long sentences are chopped, should we warn the client?
- sourceLanguage (string): The source language (not to be used initially, but may be useful later when a single TranslationService API offers multiple translation directions.)
- targetLanguage (string): The source language (not to be used initially, but may be useful later when a single TranslationService API offers multiple translation directions.)
- search paramters:
-
withAlignment: include alignment info in the response?
I suggest to make this optional, because providing it costs extra time and that is wasted if the client has no need for it. - UG
JP: +1
- withQualityEstimate: include quality estimates Details to be determined; not relevant at this point. - UG
- Optionally a callback function that is to be executed at the end of processing the request (e.g., because the client doesn't want to keep track of the future; this is reasonable for example in a message passing scenario where one thread reads from an input channel and instead of keeping track of things, the response is to be sent to an output channel; the default callback is to fulfil the promise corresponding to the future.
JP: The quality estimate should be a second request (knowing only that it is MCDropout). Mixing it here doesn't save much compute.
- ...
-
withAlignment: include alignment info in the response?
AA: The current API structure (at least from the input perspective) is in line with what Uli mentioned above. Text to be translated as a separate parameter. Additional optional requests like QE, Alignment etc provided via TranslationRequest structure. Having a separate structure for these additional requests will basically keep the API clean. Regarding the source/target language pair, I designed TranslationModel class that can translate only from 1 source to 1 target language. However, TranslationModel can later be modified (later) to support multi source to multi target language pairs (do such models exist?). API can be modified to accept language pairs and priority as separate parameters later as well.
Upon receiving a request, the server returns a TranslationRequest object (or a unique_ptr to one), which contains
- the original parameters from above
- some stats that reflect the current state of processing
- a future to the result
With respect to optional cancelling, my suggestion for implementation would be that the server returns a shared_ptr to the TranslationRequest, and we keep a weak_ptr within the service for processing. If the request goes away, the weak ptr will be invalid, so the service knows not to bother with it. - UG
AA: I feel this is kind of a secret way of cancelling an already running request. Would an explicit call to cancel be a nicer approach? We can probably return a unique id for every translation request submitted to server and client can cancel using this unique id. Of course, we need to modify the API (that can be done later) to return a unique id and a future to the translation result in the response.
AA: Current API already encapsulates the 1st and 3rd in TranslationResult structure. Could you explain a bit more on what sort of information can we provide to the client regarding the progress of the request? and What all will a client be able to do with this information?
After receiving a request the service performs the following preprocessing steps:
- sentence splitting
- tokenization
and pushes each tokenized sentence onto the MaxiBatchQueue. It gets back a future to the result for this sentence translation (or rather, a struct that contains this future, this allows us to monitor jobs while they are in progress). The MaxiBatchQueue lines up pending individual sentence translation jobs (Job) for processing.
As an interim summary: Client posts paragraph-level request, gets back struct that contains future to paragraph-level result. At the sentence level, we use the same mechanism, but that's internal to the service and not exposed to the outside.
The BatchGenerator monitors both the MaxiBatchQueue and the MiniBatchQueue. It reads at most MaxiBatchSize tokenized input (we can be flexible whether that's measured in tokens or sentences) from the MaxiBatchQueue, sorts the respective sentences, creates batches of sentences of ideally similar sentence length, and pushes those onto the MiniBatchQueue. It processes less than MaxiBatchSize input if both the MaxiBatchQueue and the MiniBatchQueue are empty. Then it just processes what's there to keep the MiniBatchQueue filled.
The translation service maintains a number of workers (one per 'device', which can be a CPU core or a GPU), each of which monitors the MiniBatchQueue and processes one batch after another in a run-until-Simon-says-stop loop. After batch processing, a callback function is called for each individual sentence that fulfils the promise. Once all promises within a multi-sentence request have been fulfiled, the promise for the request is fulfilled.
a number of workers (one per 'device', which can be a CPU core or a GPU), each of which monitors the MiniBatchQueue
Do you mean to say MaxiBatchQueue here?
I suggest to use shared and weak pointers for self-cancelling requests.
-
Service returns shared pointer to TranslationRequest object, stores only weak ptr.
-
When processing the TranslationRequest, Service stores shared pointers to the sentence-level internal jobs on the TranslationRequest, keeps only week pointers otherwise. Weak pointers to sentence-level jobs go onto the MaxiBatchQueue. When batching jobs for translation, the batch generator generates
shared_ptr<Batch>
and stores this pointer in the sentence-level job object/struct. A weak_ptr goes onto the MiniBatchQueue. If all the jobs in a batch go away (because all the respective original translation requests vanished / were cancelled), the weak_ptr in the MiniBatchQueue becomes invalid, so the workers know not to bother with them.
I realize that this involves quite a few memory allocations. I'm open to alternative suggestions.
JP: What is this now? Same payload as REST with JSON response? How do ranges come in here?
UG: I've been trying to get a response to the latter question for a while, especially with respect to the fact that JSON is usually encoded in UTF-8 and JavaScript uses UTF-16 internally, so byte ranges may not be as useful as code point ranges.
By the way, if you want to take over alignment handling (map token alignments back to StringPiece for the time being; we can scratch our heads as to how to convert that to jSON later), that would be great, as nothing has been done in that respect yet beyond reporting token alignments in the JSON handler. As for the first two questions, in the REST server is JSON blob in (via POST), JSON blob out. I'm using RapidJSON for JSON handling; the code is in the src/service/api directory and its subdirectories in mts, specifically
- job2json and hyp2json here:
RapidJSON is a bit unwieldy because it is designed to be fast and avoids memory allocation wherever possible, so Kenneth will love it.