Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inline operations while calling a torch class #10

Open
josemarcosrf opened this issue May 18, 2016 · 37 comments
Open

Inline operations while calling a torch class #10

josemarcosrf opened this issue May 18, 2016 · 37 comments
Labels

Comments

@josemarcosrf
Copy link

Hi,
First of all thanks for pytorch, is an awesome work and really useful!

I just found that when doing inline operations (i.e: on a numpy array) and passing that to a method on Torch to lets say store it as a class value something goes wrong. I guess this has to do with memory reference while passing data from one side to the other.

I made a simple example code reproducing the issue.
I have a python main.py file that imports, creates and uses a Torch class.
The torch class has an 'init' method, which does nothing, a 'set' method which sets a tensor of the class to a given numpy array. And a write method that prints the mentioned tensor.

The main.py file only creates the class and set/prints the torch tensor inside the torch class. First by creating a numpy array and modifying and storing the value in a separate variable and after modifying the numpy array inline while passing to torch.

The problem is, when passing the numpy array created inline in the call, the values the Torch class stores are not right.
The way this doesn't happen is:

  1. Avoiding inline operations
  2. Explicitly cloning the tensor inside the torch class (that's why I think is a mem. reference issue)

Running the example will be way more clarifying that my explanation.

Not sure if this is supposed to work this way, I just though that in case you are not aware this might help you and others.

I attach the two mentioned pieces of code in a zip file.

code.zip

@hughperkins
Copy link
Owner

Ok. That's a pretty comprehensive bug-report. Thanks! :-)

question: which version of python are you using, ie 2.7, 3.4 or 3.5?

@hughperkins
Copy link
Owner

I guess python2.7,rigth? because of this line:

print (cl.v).asNumpyTensor()

(python 3 would be:

print (cl.v.asNumpyTensor())

)

@josemarcosrf
Copy link
Author

You are right, Python 2.7

@hughperkins
Copy link
Owner

So, using python 2, I get the following output:

<TorchClass>.__init initialization
.set()
<TorchClass>.write This is the value:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
[torch.FloatTensor of size 10]

.set()
<TorchClass>.write This is the value:
  2
  4
  6
  8
 10
 12
 14
 16
 18
 20
[torch.FloatTensor of size 10]

.set()
<TorchClass>.write This is the value:
  3
  6
  9
 12
 15
 18
 21
 24
 27
 30
[torch.FloatTensor of size 10]

cl.v 3 6 9 12 15 18 21 24 27 30
[torch.FloatTensor of size 10]

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]

Leaving aside the output of cl.v for now, it looks like the outputs from the writes are correct?

@josemarcosrf
Copy link
Author

josemarcosrf commented May 18, 2016

Your output from 'write' is right, I get something looking like this:
` initialization
This is the value:
1
2
3
4
5
6
7
8
9
10
[torch.FloatTensor of size 10]

This is the value:
2
4
6
8
10
12
14
16
18
20
[torch.FloatTensor of size 10]

This is the value:
0.0000e+00
-1.5846e+29
4.3514e-27
-2.0005e+00
1.5000e+01
1.8000e+01
2.1000e+01
2.4000e+01
2.7000e+01
3.0000e+01
[torch.FloatTensor of size 10]

[ 0.00000000e+00 -1.58456325e+29 4.35138757e-27 -2.00048709e+00
1.50000038e+01 1.80000000e+01 2.10000000e+01 2.40000000e+01
2.70000000e+01 3.00000000e+01]`

As you can see, the values after setting the tensor from an inline operation are wrong, at least some of them.

@hughperkins
Copy link
Owner

Ok. Some difference in our systems or installation. Let's compare systems and so on first... mine is:

  • Ubuntu 16.04
  • stock 'python2.7'
  • installed into a virtual environment
  • numpy 1.11.0
  • pytorch as follows:
git log -n 5 --oneline
39bd2b1 update travis distro to 12.1
c8088bf updated readme with link to cifar.pytorch
7f7be66 Merge branch 'traceback'
31b0a0c update readme for stacktrace
65ffef8 improve stack trace a bit

You?

@hughperkins
Copy link
Owner

(hmmm.... I'm actually using a different version of torch too actually... let me double-check on a different torch distro...)

@hughperkins
Copy link
Owner

(using mainstream torch distro, same results as my earlier results above)

@josemarcosrf
Copy link
Author

Mine is quite different.

  • Mac OS X 10.10.5
  • python 2.7.10 (default, Jul 14 2015, 19:46:27) \n[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)]
  • numpy 1.8.0rc1
  • pytorch:
    39bd2b1 update travis distro to 12.1
    c8088bf updated readme with link to cifar.pytorch
    7f7be66 Merge branch 'traceback'
    31b0a0c update readme for stacktrace
    65ffef8 improve stack trace a bit

@josemarcosrf
Copy link
Author

I'll try on an Ubuntu 14.04 in a while too

@hughperkins
Copy link
Owner

ok. biggest diference here is: "Mac OS X". Why? Because on Mac, pytorch uses lua, and on linux it uses luajit. I will test using lua, I bet I can reproduce the problem like that.

@josemarcosrf
Copy link
Author

josemarcosrf commented May 18, 2016

Ohh ok, I noticed that....
Thanks for your quick responses and effort!

I'll let you know if I can reproduce the problem in the ubuntu machine.

@hughperkins
Copy link
Owner

(the switch is here by the way, just in case you were curious: https://github.com/hughperkins/pytorch/blob/master/src/nnWrapper.jinja2.cpp#L56-L60

    #ifdef USE_LUAJIT
        #define LUALIBNAME "libluajit"
    #else
        #define LUALIBNAME "libPyTorchLua"
    #endif
``` )
Looks like I should rebuild to switch.  I'll do that.  (though tempting to change it to use an enviornment variable, at runtime to switch, perhaps???)

@hughperkins
Copy link
Owner

(on linux, can switch at build time like this:

USE_LUAJIT=OFF ./build.sh 

)

@josemarcosrf
Copy link
Author

Nice, I'll try both builds, one with luajit and the other with lua.

@hughperkins
Copy link
Owner

hmmm, on linux, with lua, results still ok.

After building with config set to use lua, I run like this:

python -i main.py

... which shows the correct results

... and then verify in proc that not using luaijt:

sudo bash
cd /proc/123
grep luajit maps
# nothing
grep liblua maps
# bunch of stuff

@hughperkins
Copy link
Owner

Your numpy is older than mine. Any reason for using an older numpy? I can downgrade my numpy if necessary. I have to go now. Will look again tomorrow...

@josemarcosrf
Copy link
Author

No reason for that numpy, I'll update mine and repeat the experiments.

@josemarcosrf
Copy link
Author

I tried on an Ubuntu Machine with numpy version 1.10 and works properly.
On Mac OS X with numpy 1.11 still going wrong.... :(

@hughperkins
Copy link
Owner

Hmmm. That makes debugging challenging :-P. Can you double-check that you are indeed linking with lua and not with luajit on Mac? For example, on mac, maybe build like:

USE_LUAJIT=OFF ./build.sh 

... and see if this changes anything?

@hughperkins
Copy link
Owner

(you can also remove libluajit.[something] from ~/torch/install/lib too, and check stuff still runs. If it doesnt, you're probalby linking to libluajit, and that has memory issues on Mac

If you're building a 64 bit application on OSX which links directly or indirectly against LuaJIT, you need to link your main executable with these flags:

-pagezero_size 10000 -image_base 100000000

Also, it's recommended to rebase all (self-compiled) shared libraries which are loaded at runtime on OSX/x64 (e.g. C extension modules for Lua). See: man rebase

)

@josemarcosrf
Copy link
Author

Building with
USE_LUAJIT=OFF ./build.sh
And running the example code again throws good results, no problems and getting the expecting output for all the calls to "write".
This means the problem is with luajit and the mem issues on Mac?

The pop operation from the Torch class is still not working as expected though.
print (cl.v).asNumpyTensor()
At the end of the python main.py file of the example code throws a list with only zeros.

@hughperkins
Copy link
Owner

This means the problem is with luajit and the mem issues on Mac?

Yes. It was never intended that luajit should be used on Mac. The bug is that the following line, in build.sh is not detecting you are using Mac:

if [[ $(uname -s) == 'Darwin' ]]; then { USE_LUAJIT=OFF; } fi

Can you type uname -s and report the result please? And then I can update this line.

At the end of the python main.py file of the example code throws a list with only zeros.

Yes, agreed. will look into this. Can you post the answer to uname -s, to keep this thread in my notifications list, so it doesn't fall into the abyss :-P

@josemarcosrf
Copy link
Author

josemarcosrf commented May 24, 2016

Sorry for the late response,
Surprisingly, the output of uname -s is Darwin ....
Not sure what could have happened...

@hughperkins
Copy link
Owner

Oh, I never fixed the (cl.v).asNumpyTensor() bit. Basically things fall off my notifications if I open them, and then they disappear into the abyss :-P Can you post something once every 24 hours, till I remember please? :-D

For the uname -s bit, can you save the following as /tmp/test.sh, and run it please:

#!/bin/bash

if [[ $(uname -s) == 'Darwin' ]]; then { echo Darwin detected; } else { echo no Darwin found; } fi

Run it like:

chmod +x /tmp/test.sh
/tmp/test.sh

@hughperkins
Copy link
Owner

well... for the memory thing, yeah it is a memory issue, but I'm not quite sure what a good solution would be. We can wipe the array to zeros by doing:

from __future__ import print_function

import numpy as np

import PyTorch
import PyTorchHelpers


TorchClass = PyTorchHelpers.load_lua_class('TorchClass.lua', 'TorchClass')
cl = TorchClass()

v = [1,2,3,4,5,6,7,8,9,10]
npv = np.asarray(v, dtype=np.float32)

cl.set(npv)
#cl.write()

npv2 = npv * 2
cl.set(npv2)
#cl.write()

cl.set(npv * 3)
#cl.write()

print('cl.v', cl.v)
print(np.zeros(10, dtype=np.float32))
print('cl.v', cl.v)

output:

<TorchClass>.__init initialization
.set()
<TorchClass>.write This is the value:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
[torch.FloatTensor of size 10]

.set()
.set()
cl.v 3 6 9 12 15 18 21 24 27 30
[torch.FloatTensor of size 10]

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
cl.v 0 0 0 0 0 0 0 0 0 0
[torch.FloatTensor of size 10]

(env2) ubuntu@peach:~/Downloads/pytorch-10$ python main.py 
<TorchClass>.__init initialization
.set()
.set()
.set()
cl.v 3 6 9 12 15 18 21 24 27 30
[torch.FloatTensor of size 10]

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
cl.v 0 0 0 0 0 0 0 0 0 0
[torch.FloatTensor of size 10

so, we basically allocate a new numpy array of zeros, and that wipes the existiing array.

probably the torch tensors should add a reference to the passed in numpy arrays somehow, so they dont get freed/reused.

@hughperkins
Copy link
Owner

I guess the np array should be stored in the python torch tensor object somewhere probably, to hold a reference. However I'm not quite sure how to do this for now...

What I've tried:

In PyTorch.jinja2.pyx _as{{Real}}Tensor(myarray):, try to store the passed in np array:

{% if Real in ['Float', 'Double', 'Byte'] %}
def _as{{Real}}Tensor(myarray):
    cdef {{real}}[:] myarraymv
    cdef Storage._{{Real}}Storage storage
    cdef _{{Real}}Tensor tensor
    if str(type(myarray)) in ["<type 'numpy.ndarray'>", "<class 'numpy.ndarray'>"]:
        dims = len(myarray.shape)
        if dims >= 1:
            totalSize = 1
            size = Storage._LongStorage.newWithSize(dims)
            stride = Storage._LongStorage.newWithSize(dims)
            strideSoFar = 1
            for d in range(dims - 1, -1, -1):
                totalSize *= myarray.shape[d]
                size[d] = myarray.shape[d]
                stride[d] = strideSoFar
                strideSoFar *= size[d]
            myarraymv = myarray.reshape(totalSize)
            storage = Storage._{{Real}}Storage.newWithData(myarraymv)
            Storage.TH{{Real}}Storage_retain(storage.native) # since newWithData takes ownership

            tensor = _{{Real}}Tensor.newWithStorage(storage, 0, size, stride)
            tensor.nparray = myarray
            return tensor

This complained that nparray is not in tensor, so I modified Tensor.jinja2.pxd, to add this, as an object:

cdef class _{{Real}}Tensor(object):
    cdef TH{{Real}}Tensor *native
    cdef object nparray

This then builds and runs (with a very agressive clean.sh:

#!/bin/bash

export TORCH_INSTALL=$(dirname $(dirname $(which luajit) 2>/dev/null) 2>/dev/null)

rm -Rf build PyBuild.so dist *.egg-info cbuild src/*.so src/PyTorch.egg-info ${TORCH_INSTALL}/lib/libPyTorch*
pip uninstall -y PyTorch
rm src/Storage.cpp src/Storage.pxd src/Storage.pyx src/PyTorch.pxd src/PyTorch.pyx src/PyTorch.cpp
rm src/nnWrapper.cpp src/nnWrapper.h src/lua.pxd src/lua.pyx

However, seems not to affect the lifetime: the above test code still prints zeros.

I tried also putting into the Tensor.jinja2.pyx:

            tensor = _{{Real}}Tensor.newWithStorage(storage, 0, size, stride)
            tensor.nparray = myarray
            __Pyx_INCREF(myarray)
            return tensor

... however this fails at runtime, with __Pyx__INCREF not being known.

Thoughts?

@hughperkins
Copy link
Owner

Can I leave you to look at what options we can use for incrementing the reference? I'm trying to get a paper on cltorch published out to arxiv (perhaps 'dumped' onto arxiv is a better term). And there is only one of me :-D

@josemarcosrf
Copy link
Author

After running the shell script I got Darwin detected, anyways, is not a big deal as building disabling luajit worked perfect ( USE_LUAJIT=OFF ./build.sh )

regarding the memory issue when pulling from torch to python, this is interesting actually, allocating a new numpy array wipes the tensor...
Sure, I'll take a look as soon as possible to the above mentioned snippets you wrote and let you know if I can get something.

Thanks once more for the help and time!
Good luck with that paper!

@hughperkins hughperkins self-assigned this Sep 21, 2016
@hughperkins
Copy link
Owner

Note to self: so that I have the code if I get a moment to check this issue, it is here: https://gist.github.com/hughperkins/fdf8e27daec983e69767ef6bbaf1db9f

Current output:

<TorchClass> initialization

before allocating a simple array of zeros: cl.v
 0 3 6 9 12 15 18 21 24 27
[torch.FloatTensor of size 10]

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]

after: cl.v
 0 0 0 0 0 0 0 0 0 0
[torch.FloatTensor of size 10]

@hughperkins
Copy link
Owner

Note to self: relevant similar issue, and commit/solution:

@hughperkins
Copy link
Owner

Note to self: the draft fix from earlier, that doesnt work... : https://github.com/hughperkins/pytorch/compare/trying-fix-inline-issue?expand=1

@hughperkins
Copy link
Owner

Unit test (failing) added in c76846d

@hughperkins
Copy link
Owner

Added a bunch of debugging in ea08d23

Output: https://gist.github.com/hughperkins/98243072eafe24b8b9f919a306fe9d07

@hughperkins
Copy link
Owner

Seems like the numpy array inside the PyTorchTensor is having its lifetime linked to that of the parent PyTorchTensor correctly, but the PyTorchTensor itself is being destroyed, ie, not being incref'd based on it being owned by the underlying lua object now.

@hughperkins
Copy link
Owner

hughperkins commented Sep 21, 2016

So, it looks like:

  • in PyTorchAug.py, when we push a numpy tensor, we first wrap it in a PyTorchTensor, then call pushObject:

    pytorch/src/PyTorchAug.py

    Lines 113 to 124 in 01a8759

    if typestring in ["<class 'numpy.ndarray'>", "<type 'numpy.ndarray'>"]:
    dtypestr = str(something.dtype)
    if dtypestr == 'float32':
    pushSomething(lua, PyTorch._asFloatTensor(something))
    return
    if dtypestr == 'float64':
    pushSomething(lua, PyTorch._asDoubleTensor(something))
    return
    if dtypestr == 'uint8':
    pushSomething(lua, PyTorch._asByteTensor(something))
    return
    raise Exception('pushing numpy array with elements of type ' + dtypestr + ' it not currently implemented')
    if typestring in ["<class 'numpy.ndarray'>", "<type 'numpy.ndarray'>"]:
        dtypestr = str(something.dtype)
        if dtypestr == 'float32':
            pushSomething(lua, PyTorch._asFloatTensor(something))
            return
        if dtypestr == 'float64':
            pushSomething(lua, PyTorch._asDoubleTensor(something))
            return
        if dtypestr == 'uint8':
            pushSomething(lua, PyTorch._asByteTensor(something))
            return
raise Exception('pushing numpy array with elements of type ' + dtypestr + ' it not currently implemented')

(via

    if type(something) in luaClassesReverse:
        pushObject(lua, something)
        return

)

  • pushObject then simply registers a python-side integer in the lua registry:

    pytorch/src/PyTorchAug.py

    Lines 113 to 124 in 01a8759

    if typestring in ["<class 'numpy.ndarray'>", "<type 'numpy.ndarray'>"]:
    dtypestr = str(something.dtype)
    if dtypestr == 'float32':
    pushSomething(lua, PyTorch._asFloatTensor(something))
    return
    if dtypestr == 'float64':
    pushSomething(lua, PyTorch._asDoubleTensor(something))
    return
    if dtypestr == 'uint8':
    pushSomething(lua, PyTorch._asByteTensor(something))
    return
    raise Exception('pushing numpy array with elements of type ' + dtypestr + ' it not currently implemented')
def pushObject(lua, myobject):
    lua.pushNumber(myobject.__objectId)
    lua.getRegistry()

Obviously simply storing an integer in the lua registry wont in itself lock the lifetime of the PyTorchTensor python object to any lua objects...

@hughperkins
Copy link
Owner

This is kind of non-trivial to fix (need to start thinking about gc and stuff probably, if start holding a bunch of python-side references to python objects that have been passed lua-side). So, since most stuff works without it for now, I think I shall leave it for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants