-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Dense and Conv BatchEnsemble layers along with unit tests and example on MNIST classification using LeNet5 #4
Conversation
DwaraknathT
commented
Aug 30, 2021
- Added BatchEnsemble layers -- the idea is to factorize the weight matrix of each member in the ensemble into 3 matrices. 1 full matrix of the same shape as layer's weight matrix, and 2 fast matrices (usually rank-1 matrices). A model's weights are generated by taking the cross product of the two fast matrices and taking the hadamard product between the resultant matrix and full matrix.
- Added unit tests for both batch ensemble layers
- Added an example on using batch ensemble layers for MNIST classification using LeNet 5
…xample on MNIST classification using LeNet5
…d conv batchensemble unit test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on starting to Add GPU tests along with the regular ones? In theory it should be as straightforward as gpu(layer), gpu(input), @test ...
function ConvBatchEnsemble( | ||
k::NTuple{N,Integer}, | ||
ch::Pair{<:Integer,<:Integer}, | ||
rank::Integer, | ||
ensemble_size::Integer, | ||
σ = identity; | ||
init = glorot_normal, | ||
alpha_init = glorot_normal, | ||
gamma_init = glorot_normal, | ||
stride = 1, | ||
pad = 0, | ||
dilation = 1, | ||
groups = 1, | ||
bias = true, | ||
ensemble_bias = true, | ||
ensemble_act = identity, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as last time about keeping things simple and general.
Maybe it makes sense to have a constructor that takes in a Conv layer directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it does. I guess we can have both as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually need the input/output dimensions to create the alpha/gamma matrices. Might as well keep them in the signature, or we'll have to infer them from the conv layer's struct and that might change anytime in flux source ?
ensemble_act::F = identity, | ||
rank = 1, | ||
) where {M,F,L} | ||
ensemble_bias = create_bias(gamma, ensemble_bias, size(gamma)[1], size(gamma)[2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you test it with FluxML/Flux.jl#1402
alpha = repeat(alpha, samples_per_model) | ||
gamma = repeat(gamma, samples_per_model) | ||
# Reshape alpha, gamma to [units, batch_size, rank] | ||
e_b = reshape(e_b, (1, 1, out_size, batch_size)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Size of the bias seems relevant here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we know that the shape of the bias allocated can fit into the container its expected to be in
outputs = sum(outputs, dims = 3) | ||
outputs = reshape(outputs, (out_size, samples_per_model, ensemble_size)) | ||
# Reshape ensemble bias | ||
e_b = Flux.unsqueeze(e_b, ndims(e_b)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious: Are the sizes of bias somewhat variable in these methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, you meant the physical size in the memory ? no, those sizes are not variable. There are a fixed number of elements in the bias, we just change the shape of the array. If you meant the logical size (shape in numpy terms) the yes, they are variable.
alpha = reshape(alpha, (in_size, ensemble_size * rank)) | ||
gamma = reshape(gamma, (out_size, ensemble_size * rank)) | ||
# Repeat breaks on GPU when input dims > 2 | ||
alpha = repeat(alpha, samples_per_model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to materialise this array or can we broadcast it to higher dimensions. Something like
julia> x = ones(3,3)
3×3 Matrix{Float64}:
1.0 1.0 1.0
1.0 1.0 1.0
1.0 1.0 1.0
julia> y = zeros(3,3,3)
3×3×3 Array{Float64, 3}:
[:, :, 1] =
0.0 0.0 0.0
0.0 0.0 0.0
0.0 0.0 0.0
[:, :, 2] =
0.0 0.0 0.0
0.0 0.0 0.0
0.0 0.0 0.0
[:, :, 3] =
0.0 0.0 0.0
0.0 0.0 0.0
0.0 0.0 0.0
julia> x .+ y
3×3×3 Array{Float64, 3}:
[:, :, 1] =
1.0 1.0 1.0
1.0 1.0 1.0
1.0 1.0 1.0
[:, :, 2] =
1.0 1.0 1.0
1.0 1.0 1.0
1.0 1.0 1.0
[:, :, 3] =
1.0 1.0 1.0
1.0 1.0 1.0
1.0 1.0 1.0
Notice that the lower dimension array was broadcasted to the higher dimensions automatically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are already broadcasting the input for the last dimension (the rank dimension). I think we have to materialize the array because, conceptually, the idea is to take a minibatch of samples (of batch size B), repeat them N times to have an effective minibatch of B*N. Now, we want each N copy of the B samples to be give a different ensemble model weights. So we need the fast weights (alpha, gamma) to have the same size as batch size for to be broadcasted for the final dimension.
Also, the starting shape of the fast weights is (in_size, ensemble_size, rank) while input shape is (in_size, batch_size) -- so we need the repeat call to match the dimensions for * op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the samples are always the same, so why would it matter if its materialised or not?