This readme describes how to run CNN models such as ResNet for image recognition training on the IPU.
Deep CNN residual learning models such as ResNet are used for image recognition and classification. The training examples given below use a ResNet-50 model implemented in TensorFlow, optimised for Graphcore's IPU.
The ResNet model provided has been written to best utilize Graphcore's IPU processors. As the whole of the model is always in memory on the IPU, smaller batch sizes become more efficient than on other hardware. The IPU's built-in stochastic rounding support improves accuracy when using half-precision which allows greater throughput. The model uses loss scaling to maintain accuracy at low precision, and has techniques such as cosine learning rates and label smoothing to improve final verification accuracy. Both model and data parallelism can be used to scale training efficiently over many IPUs.
- Prepare the TensorFlow environment. Install the poplar-sdk following the README provided. Make sure to run the
enable.sh
scripts and activate a Python virtualenv with gc_tensorflow installed. - Download the data. See below for details on obtaining the datasets.
- Run the training program. For example:
python3 train.py --dataset imagenet --data-dir path-to/imagenet
The dataset used for the ResNet-50 training examples is the ImageNet LSVRC 2012 dataset. This contains about 1.28 million images in 1000 classes. It can be downloaded from here http://image-net.org/download and is approximately 150GB for the training and validation sets.
The CIFAR-10 dataset can be downloaded here https://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz, and the CIFAR-100 dataset is here https://www.cs.toronto.edu/~kriz/cifar-100-binary.tar.gz.
train.py |
The main training program |
validation.py |
Contains the validation code used in train.py but also can be run to perform validation on previously generated checkpoints. The options should be set to be the same as those used for training, with the --restore-path option pointing to the log directory of the previously generated checkpoints |
restore.py |
Used for restoring a training run from a previously saved checkpoint. For example: python restore.py --restore-path logs/RN20_bs32_BN_16.16_v0.9.115_3S9/ |
ipu_utils.py |
IPU specific utilities |
log.py |
Module containing functions for logging results |
Datasets/ |
Code for using different datasets. Currently CIFAR-10, CIFAR-100 and ImageNet are supported |
Models/ |
Code for neural networks - resnet.py : A ResNet description based on code from Graphcore's Customer Engineering team and well optimised for the IPU.- resnext.py : Definition for ResNeXt model.- squeezenet.py : Definition for SqueezeNet model.- efficientnet.py : Definition for EfficientNet models. |
LR_Schedules/ |
Different LR schedules - stepped.py : A stepped learning rate schedule with optional warmup- cosine.py : A cosine learning rate schedule with optional warmup |
test/ |
Test files - run using pytest |
The default values for each of the supported data sets should give good results. See below for details on how best to
use the --data-dir
and --dataset
options.
If you set a DATA_DIR
environment variable (export DATA_DIR=path/to/data
) then that will be used be default if
--data-dir
is omitted.
Note also that the --synthetic-data
flag can be set which will transfer randomised instead of real data.
This is a larger network and may need to be split across two or more IPUs and it is recommended to use pipelining to increase throughput. For example, it will fit on two IPUs using the following options:
python train.py --dataset imagenet --data-dir .../imagenet-data --model-size 50 \
--batch-size 2 --xla-recompute --no-validation --shards 2 --pipeline-depth 256 --pipeline-splits b3/1/relu \
--available-memory-proportion 0.1
The model will fit on a single IPU with a batch size of 1.
python train.py --dataset imagenet --data-dir .../imagenet-data --model-size 50 --batch-size 1 --available-memory-proportion 0.1
Training over many IPUs can utilise both model and data parallelism. An example using 8 IPUs with four replicas of a two stage pipeline model is:
python train.py --dataset imagenet --data-dir .../imagenet-data --model-size 50 --batch-size 2 --shards 2 \
--pipeline-depth 128 --pipeline-splits b3/0/relu --pipeline-schedule Grouped \
--available-memory-proportion 0.4 --xla-recompute --replicas 4
There are a number of options that will help achieve a higher accuracy, many of which are detailed below.
Note that some combinations of options might not work. For example, it will not be possible to add replication for a model that is a tight fit on a single IPU because replication introduces additional control code and buffers for copying updated gradients between replicas.
It is recommended to split the model across 2 or 4 IPUs for the best throughput. For example this will achieve good B0 model convergence pipelined over 4 IPUs, with the RMSprop optimiser and 32 bit precision for weight updates:
python train.py --model efficientnet --model-size B0 --data-dir .../imagenet-data --shards 4 \
--batch-size 4 --pipeline-depth 128 --pipeline-splits block2b block4a block6a --xla-recompute \
--pipeline-schedule Grouped --optimiser RMSprop --precision 16.32
Note that larger models (B1, B2 etc.) will automatically increase the image sized used. This can be
overridden with the --image-size
option if needed.
Changing the dimension of the group convolutions can make the model more efficient on the IPU. To keep the number of parameters approximately the same the expansion ratio can be reduced. For example a modified EfficientNet-B0 model, with a similar number of trainable parameters can be trained using:
python train.py --model efficientnet --model-size B0 --data-dir .../imagenet-data --shards 4 \
--batch-size 8 --pipeline-depth 64 --pipeline-splits block2a/SE block4a block5c --xla-recompute \
--pipeline-schedule Grouped --optimiser RMSprop --precision 16.32 --group-dim 16 --expand-ratio 4 --groups 4
This should give a similar validation accuracy as the standard model but with improved training throughput.
The training harness can also be used to demonstrate inference performance using the validation.py script. For example, to check inference for EfficientNet use:
python validation.py --model efficientnet --model-size B0 --dataset imagenet --batch-size 8 \
--synthetic-data --repeat 10 --batch-norm
--model
: By default this is set to resnet
, and this document is focused on ResNet
training, but other examples such as efficientnet
, squeezenet
and resnext
are available in the
Models/
directory. Consult the source code for these models to find the corresponding default options.
Training can also be performed distributed over multiple machines ("workers"). To do this, pass the argument
--distributed
and set the environment variable TF_CONFIG
with information about the distributed cluster.
The gradients are streamed to the host of each worker, then averaged across the workers over the network, and
finally streamed back to the IPUs where the weight updates are performed. To ensure that the workers perform
identical weight updates, stochastic rounding should be turned off. Distributed training gives a total batch
size of num_workers * num_replicas * pipeline_depth * batch_size
.
For example, to distribute the training over two machines, with hostnames worker0 and worker1, run this on worker0:
export TF_CONFIG='{"cluster":{"worker":["worker0:3636","worker1:3636"]},"task":{"type":"worker","index":0}}'
python train.py --dataset imagenet --data-dir .../imagenet-data --model-size 50 --batch-size 4 \
--batches-per-step 1 --shards 4 --pipeline-depth 64 --pipeline-splits b1/2/relu b2/3/relu b3/5/relu \
--xla-recompute --replicas 4 --distributed --no-stochastic-rounding --base-learning-rate -11
Run exactly the same commands on worker1, with the exception of changing the worker index in TF_CONFIG
from 0 to 1:
export TF_CONFIG='{"cluster":{"worker":["worker0:3636","worker1:3636"]},"task":{"type":"worker","index":1}}'
Alternatively, the mpirun
program can be used to start the training over the cluster, and the worker index will
be picked up from the OMPI_COMM_WORLD_RANK
environment variable:
mpirun --tag-output -H worker0,worker1 -bind-to none -map-by slot -x LD_LIBRARY_PATH -x PATH \
-x TF_CONFIG='{"cluster":{"worker":["worker0:3636","worker1:3636"]},"task":{"type":"worker"}}' \
python train.py [...]
Note: Depending on the size of the model, you might have to limit stream copy merging to avoid overflowing the host
exchange outbound page table, by setting e.g. export POPLAR_ENGINE_OPTIONS='{"opt.maxCopyMergeSize": 8388608}'
.
--model-size
: The size of the model to use. Only certain sizes are supported depending on the model and input data
size. For ImageNet data values of 18, 34 and 50 are typical, and for Cifar 14, 20 and 32. Check the code for the full
ranges.
--batch-norm
: Batch normalisation is recommended for medium and larger batch sizes that will typically be used
training with Cifar data on the IPU (and is the default for Cifar).
For ImageNet data smaller batch sizes are usually used and group normalisation is recommended (and is the default for ImageNet).
--group-norm
: Group normalisation can be used for small batch sizes (including 1) when batch normalisation would be
unsuitable. Use the --groups
option to specify the number of groups used (default 32).
--model-size
: The model to use, default 'B0' for EfficientNet-B0. Also supported are 'B1' to 'B7', plus
an unofficial 'cifar' size which can bu used with CIFAR sized datasets.
--group-dim
: The dimension used for group convolutions, default 1. Using a higher group dimension can improve
performance on the IPU but will increase the number of parameters.
--expand-ratio
: The EfficientNet expansion ratio, default 6. When using a higher group dimension this can be
reduced to keep the number of parameters approximately the same as the official models.
Use python train.py --model efficientnet --help
to see other model options.
--batch-size
: The batch size used for training. When training on IPUs this will typically be smaller than batch
sizes used on GPUs. A batch size of four or less is often used, but in these cases using group normalisation is
recommended instead of batch normalisation (see --group-norm
).
--base-learning-rate
: The base learning rate is scaled by the batch size to obtain the final learning rate. This
means that a single base-learning-rate
can be appropriate over a range of batch sizes.
--epochs
\ --iterations
: These options determine the length of training and only one should be specified.
--data-dir
: The directory in which to search for the training and validation data. ImageNet must be in TFRecords
format. CIFAR-10/100 must be in binary format. If you have set a DATA_DIR
environment variable then this will be used
if --data-dir
is omitted.
--dataset
: The dataset to use. Must be one of imagenet
, cifar-10
or cifar-100
. This can often be inferred from
the --data-dir
option.
--lr-schedule
: The learning rate schedule function to use. The default is stepped
which is configured using
--learning-rate-decay
and --learning-rate-schedule
. You can also select cosine
for a cosine learning rate.
--warmup-epochs
: Both the stepped
and cosine
learning rate schedules can have a warmup length which linearly
increases the learning rate over the first few epochs (default 5). This can help prevent early divergences in the
network when weights have been set randomly. To turn off warmup set the value to 0.
--shards
: The number of IPUs to split the model over (default 1
). If shards > 1
then the first part of the model
will be run on one IPU with later parts run on other IPUs with data passed between them. This is essential if the model
is too large to fit on a single IPU, but can also be used to increase the possible batch size.
It may be necessary to influence the automatic sharding algorithm using the --sharding-exclude-filter
and
--sharding-include-filter
options if the memory use across the IPUs is not well balanced, particularly for large models.
These options specify sub-strings of edge names that can be used when looking for a cutting point in the graph. They will be
ignored when using pipelining which uses --pipeline-splits
instead. See also --pipeline-depth
.
--pipeline-depth
: When a model is run over multiple IPUs (see --shards
) pipelining the data flow can improve
throughput by utilising more than one IPU at a time. At present the splitting points for the pipelined model must
be specified, with one less split than the number of IPUs used. Use --pipeline-splits
to specifiy the splits -
if omitted then the list of available splits will be output. The splits should be chosen to balance the memory use across the IPUs.
--precision
: Specifies the data types to use for calculations. The default, 16.16
uses half-precision
floating-point arithmetic throughout. This lowers the required memory which allows larger models to train, or larger
batch sizes to be used. It is however more prone to numerical instability and will have a small but significant
negative effect on final trained model accuracy. The 16.32
option uses half-precision for most operations but keeps
master weights in full-precision - this should improve final accuracy at the cost of extra memory usage. The 32.32
option uses full-precision floating-point throughout.
--no-stochastic-rounding
: By default stochastic rounding on the IPU is turned on. This gives extra precision and is
especially important when using half-precision numbers. Using stochastic rounding on the IPU doesn't impact the
performance of arithmetic operations and so it is generally recommended that it is left on.
--batches-per-step
: The number of batches that are performed on the device before returning to the host. Setting
this to 1 will make only a single batch be processed on the IPU(s) before returning data to the host. This can be
useful when debugging but the extra data transfers will significantly slow training. The default value of 1000 is a
reasonable compromise for most situations.
--select-ipus
: The choice of which IPU to run the training and/or validation graphs on is usually automatic, based
on availability. This option can be used to override that selection by choosing a training and optionally also
validation IPU ID. The IPU configurations can be listed using the gc-info
tool (type gc-info -l
on the command
line).
--fp-exceptions
: Turns on floating point exceptions.
--no-hostside-norm
: Moves the image normalisation from the CPU to the IPU. This can help improve throughput if
the bottleneck is on the host CPU, but marginally increases the workload and code size on the IPU.
--gc-profile
: Allows profiling for gc-profile tool, Python must be run with gc-profile:
"gc-profile -d profile_dir -- python train.py --gc-profile "
The type of execution profile can be specified (default: IPU_PROFILE):
- NO_PROFILE: indicates that there should be no execution profiling.
- DEVICE_PROFILE: indicates that the execution profile should contain only device wide events.
- IPU_PROFILE: indicates that the profile should contain IPU level execution events.
- TILE_PROFILE: indicates that the profile should contain Tile level execution events.
NOTE: If using multiple iterations, the profile only considers the first iteration.
--available-memory-proportion
: The approximate proportion of memory which is available for convolutions.
It may need to be adjusted (e.g. to 0.1) if an Out of Memory error is raised. A reasonable range is [0.05, 0.6].
Multiple values may be specified when using pipelining. In this case two values should be given for each pipeline stage
(the first is used for convolutions and the second for matmuls).
--no-validation
: Turns off validation.
--valid-batch-size
: The batch size to use for validation.
Note that the validation.py
script can be run to validate previously generated checkpoints. Use the --restore-path
option to point to the checkpoints and set up the model the same way as in training. See also the --max-ckpts-to-keep
option when training.
--synthetic-data
: Uses a synthetic dataset filled with random data. If running with this option turned on is
significantly faster than running with real data then the training speed is likely CPU bound.
--gradients-to-accumulate
: The number of gradients to accumulate before doing a weight update. This allows the
effective mini-batch size to increase to sizes that would otherwise not fit into memory. Note that when using
--pipeline-depth
this option cannot be used as gradients will instead be accumulated over the pipeline depth.
--max-ckpts-to-keep
: The maximum number of checkpoints to keep when training. Defaults to 1, except when the
validation mode is set to end
.
--replicas
: The number of replicas of the graph to use. Using N
replicas increases the batch size by a factor of
N
(as well as the number of IPUs used for training)
--optimiser
: Choice of optimiser. Default is SGD
but momentum
and RMSProp
are also available, and which
have additional options.
--momentum
: Momentum coefficient to use when --optimiser
is set to momentum
. The default is 0.9
.
--label-smoothing
: A label smoothing factor can help improve the model accuracy by reducing the polarity of the
labels in the final layer. A reasonable value might be 0.1 or 0.2. The default is 0, which implies no smoothing.
--weight-decay
: Value for weight decay bias. Setting to 0 removes weight decay.
--loss-scaling
: When using mixed or half precision, loss scaling can be used to help preserve small gradient values.
This scales the loss value calculated in the forward pass (and unscales it before the weight update) to reduce the
chance of gradients becoming zero in half precision. The default value should be suitable in most cases.
--seed
: Setting an integer as a seed will make training runs reproducible. At present this also limits the
pre-processing pipeline on the CPU to a single thread which has significant performance implications.
--standard-imagenet
: By default the ImageNet preprocessing pipeline uses optimisations to split the dataset in
order to maximise CPU throughput. This option allow you to revert to the standard ImageNet preprocessing pipeline.
--no-dataset-cache
: Don't keep a cache of the ImageNet dataset in host RAM. Using a cache gives a speed boost
after the first epoch is complete but does use a lot of memory. It can be useful to turn off the cache if multiple
training runs are happening on a single host machine.
Training can be resumed from a checkpoint using the restore.py
script. You must supply the --restore-path
option
with a valid checkpoint.
Logging information can be generated that can be used with the GC profiler tool.
Use --help
to show all available options.