Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of Memory after 1 epoch using densenet-BC 100layers(grouth_rate=12) #7

Open
ZhenyF opened this issue Jun 7, 2018 · 5 comments
Open

Comments

@ZhenyF
Copy link

ZhenyF commented Jun 7, 2018

HI,
I tried the densenet you recommanded and set the grouth_rate=12, depth=100 and batch_size=128 on two GTX1080ti.
It seems that the model will stop after a epoch.
Could you please help me with this?

@felixgwu
Copy link
Owner

felixgwu commented Jun 7, 2018

Hi @ZhenyF,
The suggest batch_size is 64 (the same as the DenseNet paper). It should use about 2.7GB.
Here is the suggest command:

python3 main.py --arch densenet --depth 100 --growth-rate 12 --bn-size 4 --compression 0.5 --data cifar10+ --epochs 300 --save save/cifar10+-densenet-bc-100

I also tried batch_size 128, which used about 5.0 GB.
I believe it should be able to fit into a GTX1080ti.

If it still doesn't work you may try this memory efficient implementation by my friend Geoff.

@ZhenyF
Copy link
Author

ZhenyF commented Jun 7, 2018

Many thanks for the reply! @felixgwu
Just out of curiosity, if it just stopped because of a larger batch size, why it can run still be trained by an epoch? I checked my two GPUs' memory and find out that only 67% are opccupied during the first epoch training.(I tried the largest densenet BC(grouth-rate=40 and depth = 190,with batch-size=64)and it just stoped at the very first beginning)
And another question is that I tried the memory recommanded efficient implementation model.
When I set efficient to True (memory efficient mode)it will output this and never start training but when I set it to False it runs as usual

`(pytorch) D:\GA\PYTorch\img_classification_pk_pytorch-master>python main.py --data cifar10+ --depth 100 --save save/cifar10+-densenetBC12_100 --arch densenet_eff
�[31mWARNING: you don't have tesnorboard_logger installed�[39m
=> creating model 'densenet_eff'
Create DenseNet-BC100 for cifar10+
loading cifar10+
{'augmentation': True, 'num_classes': 10}
with data augmentation
Files already downloaded and verified
create folder: �[32msave/cifar10+-densenetBC12_100�[39m
args:
Namespace(alpha=0.99, arch='densenet_eff', batch_size=128, beta1=0.9, beta2=0.999, bn_size=4, compression=0.5, config_of_data={'augmentation': True, 'num_classes': 10}, data='cifar10+', data_root='Z:\Datasets\CIFAR_10_dataset', death_mode='none', death_rate=0.5, decay_rate=0.1, depth=100, drop_rate=0.0, epochs=300, evaluate='', force=False, growth_rate=12, lr=0.1, momentum=0.9, nesterov=False, normalized=False, num_classes=10, num_workers=4, optimizer='sgd', patience=0, print_freq=100, resume='', save='save/cifar10+-densenetBC12_100', seed=0, start_epoch=1, tensorboard=False, trainer='train', use_validset=True, weight_decay=0.0001)

of params: 769162

Epoch 1 lr = 1.000000e-01
D:\GA\PYTorch\img_classification_pk_pytorch-master\train.py:47: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
losses.update(loss.data[0], input.size(0))
D:\PYTorch\img_classification_pk_pytorch-master\train.py:48: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
top1.update(err1[0], input.size(0))
D:\PYTorch\img_classification_pk_pytorch-master\train.py:49: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
top5.update(err5[0], input.size(0))
D:\Anaconda3\envs\pytorch\lib\site-packages\torch\cuda\nccl.py:24: UserWarning: PyTorch is not compiled with NCCL support
warnings.warn('PyTorch is not compiled with NCCL support')`

@taineleau-zz
Copy link
Collaborator

Hi @ZhenyF,

It seems that you're using PyTorch windows version. Would it be possible that it's a bug for the windows version?

@ZhenyF
Copy link
Author

ZhenyF commented Jun 15, 2018

Hi @taineleau
I am not sure if it is caused by the difference between OS. Another problem is that I cannot reach even a similar accuracy using densenet40. I can only got 6.0%(minimum 5.7%), but 5.44% on Tensorlfow. Is it possible caused by Pytorch or it is caused by my implementation error?

@taineleau-zz
Copy link
Collaborator

Hi @ZhenyF,
Did you notice that we hold out a portion of training data as validation set?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants