-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Higher-level encoder decoder interfaces for transducer, attention, LM, ILM, etc #49
Comments
Note we have |
The high-level decoder interface would also cover the search aspect (#18), e.g. the use of |
You can see ongoing work in I started to implement a generic decoder class, not just the interface but both framewise training, fullsum and search, maybe also alignment, and that for all possible cases, label-sync, time-sync, with vertical transitions, different decoder structures with slow-RNN and fast-RNN, different variants of blank split, different stochastic dependencies, etc. This turns out to be way too complicated, at least if we also want it to be efficient, because for this to be efficient in all cases, both training and decoding, all the possible neural structures, etc, there are a lot of different cases then to be handled differently. See the current implementation, which already covers a lot, but is still not really complete. I now tend to think this is a bad idea, to have such generic implementation, because it's way too complicated, and this will make it difficult to work on it, when further extensions are needed. I think it's better to just provide generic building blocks, and have specific implementations for the relevant cases. However: I think we probably still can define a generic interface. But also this is not so clear:
In all cases, the interface should allow for the most efficient implementation. As usual, it might be helpful to look at some other frameworks which cover multiple models such as CTC, RNN-T and attention-based encoder-decoder (AED), such as ESPnet, Fairseq, Lingvo.
|
Speaking a bit more conceptual: We can differentiate between:
To define the model (just parameters), we can simply take For every individual computation, like training or recognition, we probably could have a separate interface. Although some interfaces could be shared, e.g. for both recognition and alignment. |
The interfaces should be helpful to make clear what kind of module is expected, e.g. that a Hybrid encoder should have a log-softmax output in order to work correctly with the tf-flow node, to make sure the network and the rest of the pipeline match. But actual implementations should only be given as reference, as it is likely that you need changes. Example: |
Note that for search (RETURNN search or RASR), log-softmax is what you want, but for training, it is more efficient that you get logits, because then it can make use of a more efficient fused cross entropy function. You could say that logits might be more generic, and when generating the config for RETURNN search or RASR, it would just apply a So, this now generates lots of different cases. We don't really want that some case is inefficient just because of the interface. Currently, I tend to think that the interfaces ( |
Yes in my example I only meant recognition. I think the most important thing is to have well documented pipelines and reference models so that people can actually understand what is going on. This is important to make people consider switching. The rest is optional in my view, and just takes time from us that we need to test actual models. |
You mentioned focal_loss_option, i.e. about training. I think for recognition, the interface ( |
But following that argumentation, maybe we should reduce
|
The current class IMakeDecoder:
def __call__(self, source: nn.Tensor, *, spatial_dim: nn.Dim) -> IDecoder:
raise NotImplementedError |
The encoder interface is quite trivial, basically just any
[LayerRef] -> LayerRef
function, although the interface also should imply the tensor format {B,T,D} or so.The idea was to have a generic interface for the decoder which allows to define both a transducer (in its most generic form, including RNN-T, RNA, etc) either time-sync or alignment-sync, and a standard attention-based label-sync decoder.
The interface should allow for easy integration of an external LM, and also allow for integration of ILM estimation and subtraction.
A current draft is here.
We should implement some attention-based encoder-decoder and some transducer example using external LM + ILM estimation and subtraction as example.
Transformer
should then also be refactored to make use of this interface.The text was updated successfully, but these errors were encountered: