You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to figure out the implementation of coverage mechanism, and after debug for a while, I still cannot understand why is the procedure of producing coverage vector in decode mode NOT the same as that in training/eval mode.
Note that this attention decoder passes each decoder input through a linear layer with the previous step's context vector to get a modified version of the input. If initial_state_attention is False, on the first decoder step the "previous context vector" is just a zero vector. If initial_state_attention is True, we use initial_state to (re)calculate the previous step's context vector. We set this to False for train/eval mode (because we call attention_decoder once for all decoder steps) and True for decode mode (because we call attention_decoder once for each decoder step).
IMHO, the training and decode procedures would mismatch to some extend in such an implementation (Please correct me if I am wrong).
For example:
Let H be all encoder hidden states (a list of tensors), then,
In training/eval mode, every decode step use attention network only once:
Input: H, current_decoder_hidden_state, previous_coverage(None for the first decode step)
Output: next coverage, next context and attention weights( i.e. attn_dist in the code).
In decode mode, every step will apply attention mechanism twice:
(1) The first time:
Input: H, previous_decoder_hidden_state, previous_coverage (0s for the first decode step)
Output: modified previous context and next coverage (discard attention weights here)
(2) The second time:
Input: H, current_decoder_hidden_state, next coverage
Output: next context, attention weights (DO NOT update next coverage here)
The text was updated successfully, but these errors were encountered:
I am trying to figure out the implementation of coverage mechanism, and after debug for a while, I still cannot understand why is the procedure of producing coverage vector in
decode
mode NOT the same as that intraining/eval
mode.Related code is here: this line
IMHO, the training and decode procedures would mismatch to some extend in such an implementation (Please correct me if I am wrong).
For example:
Let
H
be all encoder hidden states (a list of tensors), then,In training/eval mode, every decode step use attention network only once:
Input:
H
,current_decoder_hidden_state
,previous_coverage
(None
for the first decode step)Output:
next coverage
,next context
andattention weights
( i.e.attn_dist
in the code).In decode mode, every step will apply attention mechanism twice:
(1) The first time:
Input:
H
,previous_decoder_hidden_state
,previous_coverage
(0
s for the first decode step)Output:
modified previous context
andnext coverage
(discardattention weights
here)(2) The second time:
Input:
H
,current_decoder_hidden_state
,next coverage
Output:
next context
,attention weights
(DO NOT updatenext coverage
here)The text was updated successfully, but these errors were encountered: