Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spend long time to traning data #21

Open
mjohn123 opened this issue Apr 4, 2016 · 12 comments
Open

Spend long time to traning data #21

mjohn123 opened this issue Apr 4, 2016 · 12 comments

Comments

@mjohn123
Copy link

mjohn123 commented Apr 4, 2016

Hello all, I am running the fcnTrain on my PC (core i7, 16GB Ram, GPU). However, it spend long time (0.5 day to 1 day) to running each epoch. If anyone meet the issue, could you give me the solution to solve that issue? Thank all

@HLinn
Copy link

HLinn commented Apr 7, 2016

I have met the same question,do you have any solutions later? @mjohn123

@mjohn123
Copy link
Author

mjohn123 commented Apr 7, 2016

I still looking for the solution. I did not solve it yet

@brisker
Copy link

brisker commented Apr 12, 2016

Is your gpu frequency very low like 1Hz or normal? My gpu properties:
CUDADevice with properties:

                  Name: 'GeForce GTX TITAN X'
                 Index: 1
     ComputeCapability: '5.2'
        SupportsDouble: 1
         DriverVersion: 7.5000
        ToolkitVersion: 6.5000
    MaxThreadsPerBlock: 1024
      MaxShmemPerBlock: 49152
    MaxThreadBlockSize: [1024 1024 64]
           MaxGridSize: [2.1475e+09 65535 65535]
             SIMDWidth: 32
           TotalMemory: 1.2885e+10
       AvailableMemory: 1.2609e+10
   MultiprocessorCount: 24
          ClockRateKHz: 1076000
           ComputeMode: 'Default'
  GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
      CanMapHostMemory: 1
       DeviceSupported: 1
        DeviceSelected: 1

train: epoch 01: 1/565: 0.9 Hz accuracy: 0.677 0.048 0.032 objective: 3.044
train: epoch 01: 2/565: 1.0 Hz accuracy: 0.691 0.048 0.033 objective: 3.035
train: epoch 01: 3/565: 1.1 Hz accuracy: 0.698 0.048 0.033 objective: 2.994

@mjohn123
Copy link
Author

Hi brisker. This is my GPU. Note that, I just used CUDA (not cudnn) for my simulation. I ran matconvnet with cuda mode successful.

 CUDADevice with properties:

                      Name: 'GeForce GTX 750 Ti'
                     Index: 1
         ComputeCapability: '5.0'
            SupportsDouble: 1
             DriverVersion: 7.5000
            ToolkitVersion: 7
        MaxThreadsPerBlock: 1024
          MaxShmemPerBlock: 49152
        MaxThreadBlockSize: [1024 1024 64]
               MaxGridSize: [2.1475e+09 65535 65535]
                 SIMDWidth: 32
               TotalMemory: 2.1475e+09
           AvailableMemory: 1.6844e+09
       MultiprocessorCount: 5
              ClockRateKHz: 1202000
               ComputeMode: 'Default'
      GPUOverlapsTransfers: 1
    KernelExecutionTimeout: 1
          CanMapHostMemory: 1
           DeviceSupported: 1
            DeviceSelected: 1
train: epoch 30: 259/565: 0.3 Hz accuracy: 0.905 0.776 0.681 objective: 0.258
train: epoch 30: 260/565: 0.3 Hz accuracy: 0.905 0.776 0.682 objective: 0.258
train: epoch 30: 261/565: 0.3 Hz accuracy: 0.905 0.776 0.682 objective: 0.258

@brisker
Copy link

brisker commented Apr 12, 2016

What kind of mistakes do you think may be the reason? Have you got some ideas?What is your Matlab version?

@mjohn123
Copy link
Author

Hello, I am using Matlab 8.6.0.267246 (R2015b)

@jingyanw
Copy link

I was able to get ~0.5 Hz with a CPU and >5 Hz with a GPU (GeForce GTX TITAN X).
Could you check if you've passed the gpus field to cnn_train_dag correctly? For example, you can check your GPU memory usage during training.

@mjohn123
Copy link
Author

mjohn123 commented Apr 15, 2016

Thank jingyanw. You are right. My GPU did not use when I training (as the attach file). That is reason why it spend long time (just use CPU). How could I solve it (or active GPU) for fcnTrain?

I am using win 10-matlab 2015b. I installed matconvnet by using the command vl_compilenn('enableGpu', true) and run test with vl_testnn('gpu', true) and it has no error as figure. In fcnTrain.m, I used opts.train.gpus = [] ; Is it correct? I used current code as

% Setup data fetching options
bopts.useGpu = numel(opts.train.gpus) > 0 ;

% Launch SGD
info = cnn_train_dag(net, imdb, getBatchWrapper(bopts), ...
                     opts.train, ....
                     'train', train, ...
                     'val', val, ...
                     opts.train) ;

capture

If I set opts.train.gpus = 1; then the error is

Error using vl_nnconv
Out of memory on device. To view more detail about available memory on the GPU, use 'gpuDevice()'. If the problem persists, reset the GPU
by calling 'gpuDevice(1)'.

Error in dagnn.Conv/backward (line 20)
      [derInputs{1}, derParams{1}, derParams{2}] = vl_nnconv(...

Error in dagnn.Layer/backwardAdvanced (line 118)
      [derInputs, derParams] = obj.backward ...

Error in dagnn.DagNN/eval (line 107)
  obj.layers(l).block.backwardAdvanced(obj.layers(l)) ;

Error in cnn_train_dag>process_epoch (line 194)
      net.eval(inputs, opts.derOutputs) ;

Error in cnn_train_dag (line 89)
    stats.train(epoch) = process_epoch(net, state, opts, 'train') ;

Error in fcnTrain (line 98)
info = cnn_train_dag(net, imdb, getBatchWrapper(bopts), ...

@jingyanw
Copy link

The error indicates that you do not have enough GPU memory. Running FCN takes ~5GB GPU memory on my machine.

@mjohn123
Copy link
Author

I see. My GPU only has 2GB. I also have GPU on CPU chip, but I think it is not useful. Thank you for your help. I think I can only use CPU.

@brisker
Copy link

brisker commented Apr 18, 2016

@jingyanw Hello, I also got it around 5.5Hz with titan x, but why the classification examples like mnist is more than 1000 Hz but why this fcnTrain is so slow?

@daofeng2007
Copy link

@brisker I also got it around 5.5 using Pascal Titan X. My problem is when it finished epoch 01 and started to run epoch 02, following error showed up

train: epoch 02: 1/ 56:Error using gpuArray
The GPU failed to allocate memory. To continue, reset the GPU by running 'gpuDevice(1)'. If this problem persists,
partition your computations into smaller pieces.

Error in getBatch (line 91)
ims = gpuArray(ims) ;

Error in fcnTrain>@(imdb,batch)getBatch(imdb,batch,opts,'prefetch',nargout==0) (line 108)
fn = @(imdb,batch) getBatch(imdb,batch,opts,'prefetch',nargout==0) ;

Error in cnn_train_dag>process_epoch (line 197)
inputs = state.getBatch(state.imdb, batch) ;

Error in cnn_train_dag (line 83)
[stats.train(epoch),prof] = process_epoch(net, state, opts, 'train') ;

Error in fcnTrain (line 99)
info = cnn_train_dag(net, imdb, getBatchWrapper(bopts), ...

@HosnaCSE
Copy link

HosnaCSE commented Jan 3, 2017

Hello,

I have 12GB RAM on GPU (Titan) and 8GM Ram on CPU. My net is encounter error as "Out of memory on device" which is required only 4GM RAM. Does anyone have similar experience? Any suggestion?
(I can run this with 128X128 image which require 2GB but I need bigger image like 256X256 for Semantic Segmentation)

Thanks.
Hosna

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants