Two considerations were taken into account when the mathematical operations were implemented:
- At the time of the implementation, the version of TensorFlow used (r0.11) lacks a lot regarding slicing and assigning values to slices.
- A vectorized implementation is usually better than an Implementation with a loop (usually for the possible parallelism).
Most of the operations described in the paper lend can be straightforwardly implemented in TensorFlow, except possibly for two operations: the allocation weighting calculations, and the link matrix updates; as they both are described in a slicing and looping manner, which can make their current implementation look a little convoluted. The following attempts to clarify how these operations were implemented.
In the paper, the allocation weightings are calculated using the formula:
which can be implemented naively with a loop with a runtime complexity of . There is no way to escape the slice-assignment operations, because the reordering the the sorted free-list back into its original places is crucial. However, there is a possibility to make things a little faster.
shifted_cumprod = tf.cumprod(sorted_usage, axis = 1, exclusive=True)
unordered_allocation_weighting = (1 - sorted_usage) * shifted_cumprod
allocation_weighting_batches = []
for b in range(self.batch_size):
allocation_weighting = tf.zeros([self.words_num])
unpacked_free_list = tf.unpack(free_list[b])
for pos, original_indx in enumerate(unpacked_free_list):
mask = tf.squeeze(tf.slice(self.I, [original_indx, 0], [1, -1]))
allocation_weighting += mask * unordered_allocation_weighting[b, pos]
allocation_weighting_batches.append(allocation_weighting)
return tf.pack(allocation_weighting_batches)
In this implementation, we calculate all the required products first on the sorted usage and get an unordered version of allocation weighting, then we use a loop to put back the weightings into their correct places. Because there is no differentiable way in TensorFlow to directly assign to slices of a tensor, a mask with 1 in the desired slice position and zeros else where is multiplied with the value to be assigned and added to the target tensor.
In this implementation, the loop is , but allows the exploitation of the possible parallelism of the cumprod
operation, which could take down the runtime complexity down to , where is the number of parallel processors.
I'm not sure if TensorFlow implements a parallel version of cumprod
but the implementation can exploit it if it's there.
The paper's original formulation of the link matrix update is and index-by-index operation:
A vectorized implementation of this operation can be written as:
Where is elementwise multiplication, and is a pairwise addition operator defined as:
Where . This allows TensorFlow to parallelize the operation, but of course with a cost incurred on the space complexity.
The elementwise multiplication by is to ensure that all diagonal elements are zero, thus ensuring the elimination of self-links.
-
Controller weights are samples 1 standard-deviation away from a zero mean normal distribution with a variance , where is the size of the input vector coming into the weight matrix.
-
Memory's usage and precedence vectors and link matrix are initialized to zero as specified by the paper.
-
Memory's matrix, read and write weightings, and read vectors are initialized to a very small value (10⁻⁶). Attempting to initialize them to 0 resulted in NaN after the first few iterations.
These initialization schemes were chosen after many experiments on the copy-task, as they've shown the highest degree of stability in training (The highest ratio of convergence, and the smallest ratio of NaN-outs). However, they might re-consideration with other tasks.