Determine pre-allocated storage #71

helson73 · 2016-11-30T02:00:18Z

When pre-allocation is enabled, how could I determine which node is used to calculate gradients?
In LSTM implementation, for some nodes both inputs and outputs storage are shared between clones, but for other nodes only inputs are shared.
I want to add a peephole connection to current LSTM, and felt quite confused.
It seems if I want to decide which node share both and which nodes share only input when new model is deployed, I have to fully understand how nodes are handled in gModule ...
Any idea?
Thanks.

helson73 · 2016-11-30T05:05:26Z

I also have a question about "clones", since each clone's parameters are pointed to same storage, what's the difference with one don't use clones?
Parameters are shared anyway, whether to use clones or not, right?
But with clones pre-alloc enabled, some nodes are shared, so this is the main purpose?

jsenellart · 2016-11-30T08:55:22Z

prealloc is an ugly but effective tweak.

All clones indeed share the parameters, and with preallocation we also share the internal buffers used to store gradInput and some outputs.

But for both, we share only the intermediate buffers and we cannot share any buffer exposed outside of the nn graph - since as you say, the main goal of using clones is to keep full independent modules. So in other word each clone can be represented as:

gradInput <- (CLONE) -> output
                ||
          SHARED PARAMETERS

and outside gradInput, and output can not be shared at all (while all the parameters are shared) - but whatever is inside the clones can be shared as long as we don't messed up with calculation path.

For outputs, we do have an additional constraint which is that some modules do use outputs to calculate gradInput so we cannot share them at all

I hope this help. If you want to have a peephole connection to current LSTM - the safest is that you turn off preallocation.

helson73 · 2016-12-01T07:05:58Z

@jsenellart-systran
Thank you for your help.
I am afraid it's hard to drop pre-allocation in my case.
Actually I was working with theano-based NMT systems previously, but after suffering from inflexible memory management of theano, I decided to use torch instead last week. (recently we are working on much more complex and large scale NMT systems.)
Folks in harvard and systran and their awesome work presented here actually gave me a lot motivation.

About choosing which node to share outputs, it seems like every node in torch has their own overridden functions like "updateGradInput" and "accGradParameters". If I was right, if any node's these two functions use "self.output", then output should not be shared between clones. As you said, sigmoid node should not share outputs because it's "updateGradInput" function actually use "self.output". But linear node's both two functions don't use "self.output" at all, that's why linear node in lstm could share both input and outputs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determine pre-allocated storage #71

Determine pre-allocated storage #71

helson73 commented Nov 30, 2016

helson73 commented Nov 30, 2016

jsenellart commented Nov 30, 2016

helson73 commented Dec 1, 2016

Determine pre-allocated storage #71

Determine pre-allocated storage #71

Comments

helson73 commented Nov 30, 2016

helson73 commented Nov 30, 2016

jsenellart commented Nov 30, 2016

helson73 commented Dec 1, 2016