Inline operations while calling a torch class #10

josemarcosrf · 2016-05-18T13:51:08Z

Hi,
First of all thanks for pytorch, is an awesome work and really useful!

I just found that when doing inline operations (i.e: on a numpy array) and passing that to a method on Torch to lets say store it as a class value something goes wrong. I guess this has to do with memory reference while passing data from one side to the other.

I made a simple example code reproducing the issue.
I have a python main.py file that imports, creates and uses a Torch class.
The torch class has an 'init' method, which does nothing, a 'set' method which sets a tensor of the class to a given numpy array. And a write method that prints the mentioned tensor.

The main.py file only creates the class and set/prints the torch tensor inside the torch class. First by creating a numpy array and modifying and storing the value in a separate variable and after modifying the numpy array inline while passing to torch.

The problem is, when passing the numpy array created inline in the call, the values the Torch class stores are not right.
The way this doesn't happen is:

Avoiding inline operations
Explicitly cloning the tensor inside the torch class (that's why I think is a mem. reference issue)

Running the example will be way more clarifying that my explanation.

Not sure if this is supposed to work this way, I just though that in case you are not aware this might help you and others.

I attach the two mentioned pieces of code in a zip file.

code.zip

hughperkins · 2016-05-18T14:06:43Z

Ok. That's a pretty comprehensive bug-report. Thanks! :-)

question: which version of python are you using, ie 2.7, 3.4 or 3.5?

hughperkins · 2016-05-18T14:12:57Z

I guess python2.7,rigth? because of this line:

print (cl.v).asNumpyTensor()

(python 3 would be:

print (cl.v.asNumpyTensor())

)

josemarcosrf · 2016-05-18T14:16:43Z

You are right, Python 2.7

hughperkins · 2016-05-18T14:16:44Z

So, using python 2, I get the following output:

<TorchClass>.__init initialization
.set()
<TorchClass>.write This is the value:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
[torch.FloatTensor of size 10]

.set()
<TorchClass>.write This is the value:
  2
  4
  6
  8
 10
 12
 14
 16
 18
 20
[torch.FloatTensor of size 10]

.set()
<TorchClass>.write This is the value:
  3
  6
  9
 12
 15
 18
 21
 24
 27
 30
[torch.FloatTensor of size 10]

cl.v 3 6 9 12 15 18 21 24 27 30
[torch.FloatTensor of size 10]

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]

Leaving aside the output of cl.v for now, it looks like the outputs from the writes are correct?

josemarcosrf · 2016-05-18T14:19:37Z

Your output from 'write' is right, I get something looking like this:
` initialization
This is the value:
1
2
3
4
5
6
7
8
9
10
[torch.FloatTensor of size 10]

This is the value:
2
4
6
8
10
12
14
16
18
20
[torch.FloatTensor of size 10]

This is the value:
0.0000e+00
-1.5846e+29
4.3514e-27
-2.0005e+00
1.5000e+01
1.8000e+01
2.1000e+01
2.4000e+01
2.7000e+01
3.0000e+01
[torch.FloatTensor of size 10]

[ 0.00000000e+00 -1.58456325e+29 4.35138757e-27 -2.00048709e+00
1.50000038e+01 1.80000000e+01 2.10000000e+01 2.40000000e+01
2.70000000e+01 3.00000000e+01]`

As you can see, the values after setting the tensor from an inline operation are wrong, at least some of them.

hughperkins · 2016-05-18T14:21:14Z

Ok. Some difference in our systems or installation. Let's compare systems and so on first... mine is:

Ubuntu 16.04
stock 'python2.7'
installed into a virtual environment
numpy 1.11.0
pytorch as follows:

git log -n 5 --oneline
39bd2b1 update travis distro to 12.1
c8088bf updated readme with link to cifar.pytorch
7f7be66 Merge branch 'traceback'
31b0a0c update readme for stacktrace
65ffef8 improve stack trace a bit

You?

hughperkins · 2016-05-18T14:23:11Z

(hmmm.... I'm actually using a different version of torch too actually... let me double-check on a different torch distro...)

hughperkins · 2016-05-18T14:24:30Z

(using mainstream torch distro, same results as my earlier results above)

josemarcosrf · 2016-05-18T14:26:50Z

Mine is quite different.

Mac OS X 10.10.5
python 2.7.10 (default, Jul 14 2015, 19:46:27) \n[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)]
numpy 1.8.0rc1
pytorch:
39bd2b1 update travis distro to 12.1
c8088bf updated readme with link to cifar.pytorch
7f7be66 Merge branch 'traceback'
31b0a0c update readme for stacktrace
65ffef8 improve stack trace a bit

josemarcosrf · 2016-05-18T14:27:12Z

I'll try on an Ubuntu 14.04 in a while too

hughperkins · 2016-05-18T14:27:44Z

ok. biggest diference here is: "Mac OS X". Why? Because on Mac, pytorch uses lua, and on linux it uses luajit. I will test using lua, I bet I can reproduce the problem like that.

josemarcosrf · 2016-05-18T14:29:36Z

Ohh ok, I noticed that....
Thanks for your quick responses and effort!

I'll let you know if I can reproduce the problem in the ubuntu machine.

hughperkins · 2016-05-18T14:30:46Z

(the switch is here by the way, just in case you were curious: https://github.com/hughperkins/pytorch/blob/master/src/nnWrapper.jinja2.cpp#L56-L60

    #ifdef USE_LUAJIT
        #define LUALIBNAME "libluajit"
    #else
        #define LUALIBNAME "libPyTorchLua"
    #endif
``` )
Looks like I should rebuild to switch.  I'll do that.  (though tempting to change it to use an enviornment variable, at runtime to switch, perhaps???)

hughperkins · 2016-05-18T14:32:55Z

(on linux, can switch at build time like this:

USE_LUAJIT=OFF ./build.sh

)

josemarcosrf · 2016-05-18T14:34:16Z

Nice, I'll try both builds, one with luajit and the other with lua.

hughperkins · 2016-05-18T14:36:42Z

hmmm, on linux, with lua, results still ok.

After building with config set to use lua, I run like this:

python -i main.py

... which shows the correct results

... and then verify in proc that not using luaijt:

sudo bash
cd /proc/123
grep luajit maps
# nothing
grep liblua maps
# bunch of stuff

hughperkins · 2016-05-18T14:38:09Z

Your numpy is older than mine. Any reason for using an older numpy? I can downgrade my numpy if necessary. I have to go now. Will look again tomorrow...

josemarcosrf · 2016-05-18T15:25:01Z

No reason for that numpy, I'll update mine and repeat the experiments.

josemarcosrf · 2016-05-19T09:22:16Z

I tried on an Ubuntu Machine with numpy version 1.10 and works properly.
On Mac OS X with numpy 1.11 still going wrong.... :(

hughperkins · 2016-05-19T10:54:53Z

Hmmm. That makes debugging challenging :-P. Can you double-check that you are indeed linking with lua and not with luajit on Mac? For example, on mac, maybe build like:

USE_LUAJIT=OFF ./build.sh

... and see if this changes anything?

hughperkins · 2016-05-19T13:15:04Z

(you can also remove libluajit.[something] from ~/torch/install/lib too, and check stuff still runs. If it doesnt, you're probalby linking to libluajit, and that has memory issues on Mac

If you're building a 64 bit application on OSX which links directly or indirectly against LuaJIT, you need to link your main executable with these flags:

-pagezero_size 10000 -image_base 100000000

Also, it's recommended to rebase all (self-compiled) shared libraries which are loaded at runtime on OSX/x64 (e.g. C extension modules for Lua). See: man rebase

)

josemarcosrf · 2016-05-20T10:11:23Z

Building with
USE_LUAJIT=OFF ./build.sh
And running the example code again throws good results, no problems and getting the expecting output for all the calls to "write".
This means the problem is with luajit and the mem issues on Mac?

The pop operation from the Torch class is still not working as expected though.
print (cl.v).asNumpyTensor()
At the end of the python main.py file of the example code throws a list with only zeros.

hughperkins · 2016-05-20T10:30:40Z

This means the problem is with luajit and the mem issues on Mac?

Yes. It was never intended that luajit should be used on Mac. The bug is that the following line, in build.sh is not detecting you are using Mac:

if [[ $(uname -s) == 'Darwin' ]]; then { USE_LUAJIT=OFF; } fi

Can you type uname -s and report the result please? And then I can update this line.

At the end of the python main.py file of the example code throws a list with only zeros.

Yes, agreed. will look into this. Can you post the answer to uname -s, to keep this thread in my notifications list, so it doesn't fall into the abyss :-P

josemarcosrf · 2016-05-24T09:53:25Z

Sorry for the late response,
Surprisingly, the output of uname -s is Darwin ....
Not sure what could have happened...

hughperkins · 2016-05-24T10:35:19Z

Oh, I never fixed the (cl.v).asNumpyTensor() bit. Basically things fall off my notifications if I open them, and then they disappear into the abyss :-P Can you post something once every 24 hours, till I remember please? :-D

For the uname -s bit, can you save the following as /tmp/test.sh, and run it please:

#!/bin/bash

if [[ $(uname -s) == 'Darwin' ]]; then { echo Darwin detected; } else { echo no Darwin found; } fi

Run it like:

chmod +x /tmp/test.sh
/tmp/test.sh

hughperkins · 2016-05-24T11:05:13Z

well... for the memory thing, yeah it is a memory issue, but I'm not quite sure what a good solution would be. We can wipe the array to zeros by doing:

from __future__ import print_function

import numpy as np

import PyTorch
import PyTorchHelpers


TorchClass = PyTorchHelpers.load_lua_class('TorchClass.lua', 'TorchClass')
cl = TorchClass()

v = [1,2,3,4,5,6,7,8,9,10]
npv = np.asarray(v, dtype=np.float32)

cl.set(npv)
#cl.write()

npv2 = npv * 2
cl.set(npv2)
#cl.write()

cl.set(npv * 3)
#cl.write()

print('cl.v', cl.v)
print(np.zeros(10, dtype=np.float32))
print('cl.v', cl.v)

output:

<TorchClass>.__init initialization
.set()
<TorchClass>.write This is the value:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
[torch.FloatTensor of size 10]

.set()
.set()
cl.v 3 6 9 12 15 18 21 24 27 30
[torch.FloatTensor of size 10]

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
cl.v 0 0 0 0 0 0 0 0 0 0
[torch.FloatTensor of size 10]

(env2) ubuntu@peach:~/Downloads/pytorch-10$ python main.py 
<TorchClass>.__init initialization
.set()
.set()
.set()
cl.v 3 6 9 12 15 18 21 24 27 30
[torch.FloatTensor of size 10]

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
cl.v 0 0 0 0 0 0 0 0 0 0
[torch.FloatTensor of size 10

so, we basically allocate a new numpy array of zeros, and that wipes the existiing array.

probably the torch tensors should add a reference to the passed in numpy arrays somehow, so they dont get freed/reused.

hughperkins · 2016-05-24T12:28:33Z

I guess the np array should be stored in the python torch tensor object somewhere probably, to hold a reference. However I'm not quite sure how to do this for now...

What I've tried:

In PyTorch.jinja2.pyx _as{{Real}}Tensor(myarray):, try to store the passed in np array:

{% if Real in ['Float', 'Double', 'Byte'] %}
def _as{{Real}}Tensor(myarray):
    cdef {{real}}[:] myarraymv
    cdef Storage._{{Real}}Storage storage
    cdef _{{Real}}Tensor tensor
    if str(type(myarray)) in ["<type 'numpy.ndarray'>", "<class 'numpy.ndarray'>"]:
        dims = len(myarray.shape)
        if dims >= 1:
            totalSize = 1
            size = Storage._LongStorage.newWithSize(dims)
            stride = Storage._LongStorage.newWithSize(dims)
            strideSoFar = 1
            for d in range(dims - 1, -1, -1):
                totalSize *= myarray.shape[d]
                size[d] = myarray.shape[d]
                stride[d] = strideSoFar
                strideSoFar *= size[d]
            myarraymv = myarray.reshape(totalSize)
            storage = Storage._{{Real}}Storage.newWithData(myarraymv)
            Storage.TH{{Real}}Storage_retain(storage.native) # since newWithData takes ownership

            tensor = _{{Real}}Tensor.newWithStorage(storage, 0, size, stride)
            tensor.nparray = myarray
            return tensor

This complained that nparray is not in tensor, so I modified Tensor.jinja2.pxd, to add this, as an object:

cdef class _{{Real}}Tensor(object):
    cdef TH{{Real}}Tensor *native
    cdef object nparray

This then builds and runs (with a very agressive clean.sh:

#!/bin/bash

export TORCH_INSTALL=$(dirname $(dirname $(which luajit) 2>/dev/null) 2>/dev/null)

rm -Rf build PyBuild.so dist *.egg-info cbuild src/*.so src/PyTorch.egg-info ${TORCH_INSTALL}/lib/libPyTorch*
pip uninstall -y PyTorch
rm src/Storage.cpp src/Storage.pxd src/Storage.pyx src/PyTorch.pxd src/PyTorch.pyx src/PyTorch.cpp
rm src/nnWrapper.cpp src/nnWrapper.h src/lua.pxd src/lua.pyx

However, seems not to affect the lifetime: the above test code still prints zeros.

I tried also putting into the Tensor.jinja2.pyx:

            tensor = _{{Real}}Tensor.newWithStorage(storage, 0, size, stride)
            tensor.nparray = myarray
            __Pyx_INCREF(myarray)
            return tensor

... however this fails at runtime, with __Pyx__INCREF not being known.

Thoughts?

hughperkins · 2016-05-24T14:04:41Z

Can I leave you to look at what options we can use for incrementing the reference? I'm trying to get a paper on cltorch published out to arxiv (perhaps 'dumped' onto arxiv is a better term). And there is only one of me :-D

josemarcosrf · 2016-05-24T14:11:15Z

After running the shell script I got Darwin detected, anyways, is not a big deal as building disabling luajit worked perfect ( USE_LUAJIT=OFF ./build.sh )

regarding the memory issue when pulling from torch to python, this is interesting actually, allocating a new numpy array wipes the tensor...
Sure, I'll take a look as soon as possible to the above mentioned snippets you wrote and let you know if I can get something.

Thanks once more for the help and time!
Good luck with that paper!

hughperkins · 2016-09-21T12:11:32Z

Note to self: so that I have the code if I get a moment to check this issue, it is here: https://gist.github.com/hughperkins/fdf8e27daec983e69767ef6bbaf1db9f

Current output:

<TorchClass> initialization

before allocating a simple array of zeros: cl.v
 0 3 6 9 12 15 18 21 24 27
[torch.FloatTensor of size 10]

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]

after: cl.v
 0 0 0 0 0 0 0 0 0 0
[torch.FloatTensor of size 10]

hughperkins · 2016-09-21T12:21:57Z

Note to self: relevant similar issue, and commit/solution:

issue: How to Save/Load a NeuralNet via python? DeepCL#84 (comment)
commit/solution: hughperkins/DeepCL@450dba1

hughperkins · 2016-09-21T12:41:20Z

Note to self: the draft fix from earlier, that doesnt work... : https://github.com/hughperkins/pytorch/compare/trying-fix-inline-issue?expand=1

hughperkins · 2016-09-21T13:54:34Z

Unit test (failing) added in c76846d

hughperkins · 2016-09-21T14:40:30Z

Added a bunch of debugging in ea08d23

Output: https://gist.github.com/hughperkins/98243072eafe24b8b9f919a306fe9d07

hughperkins · 2016-09-21T14:45:35Z

Seems like the numpy array inside the PyTorchTensor is having its lifetime linked to that of the parent PyTorchTensor correctly, but the PyTorchTensor itself is being destroyed, ie, not being incref'd based on it being owned by the underlying lua object now.

hughperkins · 2016-09-21T14:55:10Z

So, it looks like:

in PyTorchAug.py, when we push a numpy tensor, we first wrap it in a PyTorchTensor, then call pushObject:

pytorch/src/PyTorchAug.py

Lines 113 to 124 in 01a8759

    
           if typestring in ["<class 'numpy.ndarray'>", "<type 'numpy.ndarray'>"]: 
        
               dtypestr = str(something.dtype) 
        
               if dtypestr == 'float32': 
        
                   pushSomething(lua, PyTorch._asFloatTensor(something)) 
        
                   return 
        
               if dtypestr == 'float64': 
        
                   pushSomething(lua, PyTorch._asDoubleTensor(something)) 
        
                   return 
        
               if dtypestr == 'uint8': 
        
                   pushSomething(lua, PyTorch._asByteTensor(something)) 
        
                   return 
        
               raise Exception('pushing numpy array with elements of type ' + dtypestr + ' it not currently implemented')

    if typestring in ["<class 'numpy.ndarray'>", "<type 'numpy.ndarray'>"]:
        dtypestr = str(something.dtype)
        if dtypestr == 'float32':
            pushSomething(lua, PyTorch._asFloatTensor(something))
            return
        if dtypestr == 'float64':
            pushSomething(lua, PyTorch._asDoubleTensor(something))
            return
        if dtypestr == 'uint8':
            pushSomething(lua, PyTorch._asByteTensor(something))
            return
raise Exception('pushing numpy array with elements of type ' + dtypestr + ' it not currently implemented')

(via

    if type(something) in luaClassesReverse:
        pushObject(lua, something)
        return

)

pushObject then simply registers a python-side integer in the lua registry:

pytorch/src/PyTorchAug.py

Lines 113 to 124 in 01a8759

    
           if typestring in ["<class 'numpy.ndarray'>", "<type 'numpy.ndarray'>"]: 
        
               dtypestr = str(something.dtype) 
        
               if dtypestr == 'float32': 
        
                   pushSomething(lua, PyTorch._asFloatTensor(something)) 
        
                   return 
        
               if dtypestr == 'float64': 
        
                   pushSomething(lua, PyTorch._asDoubleTensor(something)) 
        
                   return 
        
               if dtypestr == 'uint8': 
        
                   pushSomething(lua, PyTorch._asByteTensor(something)) 
        
                   return 
        
               raise Exception('pushing numpy array with elements of type ' + dtypestr + ' it not currently implemented')

def pushObject(lua, myobject):
    lua.pushNumber(myobject.__objectId)
    lua.getRegistry()

Obviously simply storing an integer in the lua registry wont in itself lock the lifetime of the PyTorchTensor python object to any lua objects...

hughperkins · 2016-09-22T11:53:40Z

This is kind of non-trivial to fix (need to start thinking about gc and stuff probably, if start holding a bunch of python-side references to python objects that have been passed lua-side). So, since most stuff works without it for now, I think I shall leave it for now.

hughperkins self-assigned this Sep 21, 2016

hughperkins added the bug label Sep 21, 2016

hughperkins removed their assignment Sep 22, 2016

hughperkins mentioned this issue Oct 6, 2016

How to delete PyTorch objects correctly from memory within Python? #19

Open

Inline operations while calling a torch class #10

Inline operations while calling a torch class #10

Comments

josemarcosrf commented May 18, 2016

hughperkins commented May 18, 2016

hughperkins commented May 18, 2016

josemarcosrf commented May 18, 2016

hughperkins commented May 18, 2016

josemarcosrf commented May 18, 2016 • edited Loading

hughperkins commented May 18, 2016

hughperkins commented May 18, 2016

hughperkins commented May 18, 2016

josemarcosrf commented May 18, 2016

josemarcosrf commented May 18, 2016

hughperkins commented May 18, 2016

josemarcosrf commented May 18, 2016 • edited Loading

hughperkins commented May 18, 2016

hughperkins commented May 18, 2016

josemarcosrf commented May 18, 2016

hughperkins commented May 18, 2016

hughperkins commented May 18, 2016

josemarcosrf commented May 18, 2016

josemarcosrf commented May 19, 2016

hughperkins commented May 19, 2016

hughperkins commented May 19, 2016

josemarcosrf commented May 20, 2016

hughperkins commented May 20, 2016

josemarcosrf commented May 24, 2016 • edited Loading

hughperkins commented May 24, 2016

hughperkins commented May 24, 2016

hughperkins commented May 24, 2016

hughperkins commented May 24, 2016

josemarcosrf commented May 24, 2016

hughperkins commented Sep 21, 2016

hughperkins commented Sep 21, 2016

hughperkins commented Sep 21, 2016

hughperkins commented Sep 21, 2016

hughperkins commented Sep 21, 2016

hughperkins commented Sep 21, 2016

hughperkins commented Sep 21, 2016 • edited Loading

hughperkins commented Sep 22, 2016

josemarcosrf commented May 18, 2016 •

edited

Loading

josemarcosrf commented May 18, 2016 •

edited

Loading

josemarcosrf commented May 24, 2016 •

edited

Loading

hughperkins commented Sep 21, 2016 •

edited

Loading