"Improving neural networks by enforcing co-adaptation of feature detectors"
Imagine starting from an arbitrary layer of a neural network with input vector
To set "compression" as an optimization problem we could pose it as
"Hit the target as close as possible using either
$k=1,2,\dots$ or all$n$ features"
I.e learning a representation that is incrementally better the more features you add. Let's describe this as explicitly minimizing the weighted sum of the
where
This would be a lot of forward passes (1 per feature) so what if we instead randomly sample
Doing so we see that in expectation (=large batchsize) we approximate the original objective:
And that's all there is to it!
I'll add details how we sample
TailDropout is a nn.Module
with the same API as nn.Dropout
, applied to a tensor x
:
from taildropout import TailDropout
dropout = TailDropout(p=0.5,batch_dim=0, dropout_dim=-1)
y = dropout(x)
At training time, keep a random k
first features. Results are as expected; this makes a layer learn features that are of additive importance, like PCA.
See example.ipynb for complete examples.
To use it for pruning or estimating the optimal size of hidden dim, calculate n_features vs loss and create a scree plot:
losses = []
for k in range(n_features):
model.dropout.set_k(k)
losses.append(criterion(y, model(x)))
plt.plot(range(n_features), losses)
plt.title("Loss vs n_features used")
I'm happy to release this since I've found it very useful over the years. I've used it for
- Estimating the optimal #features per layer
- In place of dropout for regularization
- To be able to choose a model size (after training to overfit!) that generalizes.
- For fiddling with neural networks. ("mechanistic interpretability")
The implementation is faster than nn.Dropout
, supports multi-GPU and torch.compile()'s.
At each layer, a scalar input feature x[j]
of a feature vector x
decides how far to map input into the direction W[:,j]
of the layer output space. This is done by W[:,j]*x[j]
:
Teach each k first directions to map input to target as good as possible.
Each direction has decreasing probability of being used.
Teach each
Each direction in W has same inclusion probability but there's
Regular dropout scales input by .eval()
mode meaning with
If W
is some weights, then the SVD compression (same as PCA) is
U, s, V = SVD(W)
assert W == U @ s @ V
W = torch.randn([2,10])
U, s, V = torch.linalg.svd(W)
s = torch.hstack([torch.diag(s), torch.zeros(2, 8)])
torch.testing.assert_close(
W,
U @ s @ V
)
With s
the eigenvalues of W
. To use the k
first factors/ components/ eigenvectors to represent W
, set s[k:]=0
Note that SVD compresses W
optimally w.r.t the Euclidian (L2) norm for every k
:
but you want to compress each layer w.r.t the final loss function and lots of non-linearities in between!
When using TailDropout on the embedding layer, k
controlls the compression rate:
Here even with k=1
the resulting 1d-scalar embedding apparently separates shoes and shirts.
Compare this to how regular dropout works. Well, it's quite more random.
dropout = TailDropout()
dropout.train()
dropout(x) # random
dropout.eval()
dropout(x) # Identity function
dropout.set_k(k)
dropout(x) # use first k features
"2d Dropout" == Keep mask constant over spatial dimension. Popular approach.
x = torch.randn(n_batch,n_features,n_pixels_x,n_pixels_y)
cnn = nn.Conv2d(n_features,n_features, kernel_size)
taildropout = TailDropout(batch_dim = 0, dropout_dim = 1)
x = cnn(x)
x = taildropout(x)
If you don't care much about regularization, dropout probability in order 1e-5 still
seems to give good compression effect. I typically use TailDropout(p=0.001)
to get both.
@misc{Martinsson2018,
author = {Egil Martinsson},
title = {TailDropout},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/naver/taildropout}},
commit = {master}
}
This work was open sourced 2025 but work primarily done in 2018 at Naver Clova/Clair. Big thanks to Minjoon Seo for the original inspiration from his work on Skim-RNN and Ji-Hoon Kim Adrian Kim, Jaesung Huh , Prof. Jung-Woo Ha and Prof. Sung Kim for valuable discussions and feedback.
I'm sure this simple idea has been implemented before 2018 (which I was unaware of at the time) or after (which I have not had time to look for). Please let me know if there's anything relevant I should cite.