Out of Memory after 1 epoch using densenet-BC 100layers(grouth_rate=12) #7

ZhenyF · 2018-06-07T04:32:15Z

HI,
I tried the densenet you recommanded and set the grouth_rate=12, depth=100 and batch_size=128 on two GTX1080ti.
It seems that the model will stop after a epoch.
Could you please help me with this?

felixgwu · 2018-06-07T06:18:49Z

Hi @ZhenyF,
The suggest batch_size is 64 (the same as the DenseNet paper). It should use about 2.7GB.
Here is the suggest command:

python3 main.py --arch densenet --depth 100 --growth-rate 12 --bn-size 4 --compression 0.5 --data cifar10+ --epochs 300 --save save/cifar10+-densenet-bc-100

I also tried batch_size 128, which used about 5.0 GB.
I believe it should be able to fit into a GTX1080ti.

If it still doesn't work you may try this memory efficient implementation by my friend Geoff.

ZhenyF · 2018-06-07T09:20:39Z

Many thanks for the reply! @felixgwu
Just out of curiosity, if it just stopped because of a larger batch size, why it can run still be trained by an epoch? I checked my two GPUs' memory and find out that only 67% are opccupied during the first epoch training.(I tried the largest densenet BC(grouth-rate=40 and depth = 190,with batch-size=64)and it just stoped at the very first beginning)
And another question is that I tried the memory recommanded efficient implementation model.
When I set efficient to True (memory efficient mode)it will output this and never start training but when I set it to False it runs as usual

`(pytorch) D:\GA\PYTorch\img_classification_pk_pytorch-master>python main.py --data cifar10+ --depth 100 --save save/cifar10+-densenetBC12_100 --arch densenet_eff
�[31mWARNING: you don't have tesnorboard_logger installed�[39m
=> creating model 'densenet_eff'
Create DenseNet-BC100 for cifar10+
loading cifar10+
{'augmentation': True, 'num_classes': 10}
with data augmentation
Files already downloaded and verified
create folder: �[32msave/cifar10+-densenetBC12_100�[39m
args:
Namespace(alpha=0.99, arch='densenet_eff', batch_size=128, beta1=0.9, beta2=0.999, bn_size=4, compression=0.5, config_of_data={'augmentation': True, 'num_classes': 10}, data='cifar10+', data_root='Z:\Datasets\CIFAR_10_dataset', death_mode='none', death_rate=0.5, decay_rate=0.1, depth=100, drop_rate=0.0, epochs=300, evaluate='', force=False, growth_rate=12, lr=0.1, momentum=0.9, nesterov=False, normalized=False, num_classes=10, num_workers=4, optimizer='sgd', patience=0, print_freq=100, resume='', save='save/cifar10+-densenetBC12_100', seed=0, start_epoch=1, tensorboard=False, trainer='train', use_validset=True, weight_decay=0.0001)

of params: 769162

Epoch 1 lr = 1.000000e-01
D:\GA\PYTorch\img_classification_pk_pytorch-master\train.py:47: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
losses.update(loss.data[0], input.size(0))
D:\PYTorch\img_classification_pk_pytorch-master\train.py:48: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
top1.update(err1[0], input.size(0))
D:\PYTorch\img_classification_pk_pytorch-master\train.py:49: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
top5.update(err5[0], input.size(0))
D:\Anaconda3\envs\pytorch\lib\site-packages\torch\cuda\nccl.py:24: UserWarning: PyTorch is not compiled with NCCL support
warnings.warn('PyTorch is not compiled with NCCL support')`

taineleau-zz · 2018-06-13T18:22:48Z

Hi @ZhenyF,

It seems that you're using PyTorch windows version. Would it be possible that it's a bug for the windows version?

ZhenyF · 2018-06-15T00:07:54Z

Hi @taineleau
I am not sure if it is caused by the difference between OS. Another problem is that I cannot reach even a similar accuracy using densenet40. I can only got 6.0%(minimum 5.7%), but 5.44% on Tensorlfow. Is it possible caused by Pytorch or it is caused by my implementation error?

taineleau-zz · 2018-06-23T13:37:46Z

Hi @ZhenyF,
Did you notice that we hold out a portion of training data as validation set?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of Memory after 1 epoch using densenet-BC 100layers(grouth_rate=12) #7

Out of Memory after 1 epoch using densenet-BC 100layers(grouth_rate=12) #7

ZhenyF commented Jun 7, 2018

felixgwu commented Jun 7, 2018

ZhenyF commented Jun 7, 2018

taineleau-zz commented Jun 13, 2018

ZhenyF commented Jun 15, 2018

taineleau-zz commented Jun 23, 2018

Out of Memory after 1 epoch using densenet-BC 100layers(grouth_rate=12) #7

Out of Memory after 1 epoch using densenet-BC 100layers(grouth_rate=12) #7

Comments

ZhenyF commented Jun 7, 2018

felixgwu commented Jun 7, 2018

ZhenyF commented Jun 7, 2018

of params: 769162

taineleau-zz commented Jun 13, 2018

ZhenyF commented Jun 15, 2018

taineleau-zz commented Jun 23, 2018