Very slow algorithm, is that normal? #6

lucastononrodrigues · 2021-06-21T13:17:57Z

Hello,

I implemented the algorithm in the vision transformer architecture the following way:

#inside __init__()
self.spe = SineSPE(num_heads=head_cnt,in_features=in_dim,num_sines=5,num_realizations=64)
self.filter = SPEFilter(gated=False,code_shape=self.spe.code_shape)

#inside forward()
q,k=self.filter(q,k,self.spe(q.shape[:2]))
qk,kp = performer(...)
out=lin_attention(...)

The model I am using has 4 layers 6 heads and embedding dimension 384, patch_size=4.

Training 100 epochs with CIFAR100 converges to 42.3% and without SPE 45.3%. Although this can be expected, with SPE the training time is around 6x longer, is that normal?
Performers + ViT takes 39 minutes
Perfomers + ViT + SPE takes around 4 hours
For both I am using 2 Titan XP GPUs.

This is very problematic to me because I was considering scaling up those experiments with imagenet.

I would also like to know how can I implement the indexing T=N^2 for images (where did you do it in the lra benchmark?), according to section 2 of the paper.

Many thanks!

The text was updated successfully, but these errors were encountered:

cifkao · 2021-06-28T07:10:37Z

We did experience longer training times with SPE, but not 6×, more like 2× in the case of SineSPE. We provide training times in the appendix.

Note that we shared the positional codes across layers in most of our experiments. This means storing the result of self.spe(q.shape[:2]) and passing it to every attention layer in your network. You can also try sharing across attention heads, which we haven't tried.

Also, due to sample-wise sharing (i.e. among samples within a batch), SPE benefits from large batch sizes. If your batch size is small, you may indeed get a big performance hit.

cifkao · 2021-06-28T07:15:49Z

I would also like to know how can I implement the indexing T=N^2 for images (where did you do it in the lra benchmark?), according to section 2 of the paper.

We did not do this in our experiments, we simply used the 1D indexing that the APE baseline uses. It should be possible to achieve 2D indexing with ConvSPE by passing shape=(batch_size, N, N), but SineSPE is only defined for 1D sequences.

aliutkus · 2021-06-28T19:25:05Z

Dear @lucastononrodrigues, sorry for the delay.

I would add that you can actually implement sineSPE with 2D signals quite straightforwardly. The trick would be to replace the

\sum_{k=1}^K \lambda_{kd}^2 cos(2\pi f_{kd} (m-n) + \theta_{kd} )

In equation (18) of the paper by a 2D - vector $\textbf{f}_{kd}$ and 2D indices $\textbf{m}$ and $\textbf{n}$, so as to yield:

\sum_{k=1}^K \lambda_{kd}^2 cos(2\pi \textbf{f}_{kd}^\top (\textbf{m}-\textbf{n}) + \theta_{kd} )

and this can be implemented pretty straightforwardly by simply replacing the $cos(2\pi a_k n+b_k)$ in the construction of $\textbf{\Omega}$ by a $cos(2\pi \textbf{a}_k^\top \textbf{n}+b_k)$ (same thing for sin), all the rest being identical.

I don't have the time right now to do it but would be glad to get a pull request for it, or we may also collaborate on this further, for instance on a fork of yours that we would merge later ?

I would also be interested of course in identifying what is exactly slowing down your experiments, so that we may work on this.

best

antoine

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very slow algorithm, is that normal? #6

Very slow algorithm, is that normal? #6

lucastononrodrigues commented Jun 21, 2021 •

edited

Loading

cifkao commented Jun 28, 2021

cifkao commented Jun 28, 2021 •

edited

Loading

aliutkus commented Jun 28, 2021

Very slow algorithm, is that normal? #6

Very slow algorithm, is that normal? #6

Comments

lucastononrodrigues commented Jun 21, 2021 • edited Loading

cifkao commented Jun 28, 2021

cifkao commented Jun 28, 2021 • edited Loading

aliutkus commented Jun 28, 2021

lucastononrodrigues commented Jun 21, 2021 •

edited

Loading

cifkao commented Jun 28, 2021 •

edited

Loading