Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow algorithm, is that normal? #6

Open
lucastononrodrigues opened this issue Jun 21, 2021 · 3 comments
Open

Very slow algorithm, is that normal? #6

lucastononrodrigues opened this issue Jun 21, 2021 · 3 comments

Comments

@lucastononrodrigues
Copy link

lucastononrodrigues commented Jun 21, 2021

Hello,

I implemented the algorithm in the vision transformer architecture the following way:

#inside __init__()
self.spe = SineSPE(num_heads=head_cnt,in_features=in_dim,num_sines=5,num_realizations=64)
self.filter = SPEFilter(gated=False,code_shape=self.spe.code_shape)

#inside forward()
q,k=self.filter(q,k,self.spe(q.shape[:2]))
qk,kp = performer(...)
out=lin_attention(...)

The model I am using has 4 layers 6 heads and embedding dimension 384, patch_size=4.

Training 100 epochs with CIFAR100 converges to 42.3% and without SPE 45.3%. Although this can be expected, with SPE the training time is around 6x longer, is that normal?
Performers + ViT takes 39 minutes
Perfomers + ViT + SPE takes around 4 hours
For both I am using 2 Titan XP GPUs.

This is very problematic to me because I was considering scaling up those experiments with imagenet.

I would also like to know how can I implement the indexing T=N^2 for images (where did you do it in the lra benchmark?), according to section 2 of the paper.

Many thanks!

@cifkao
Copy link
Collaborator

cifkao commented Jun 28, 2021

We did experience longer training times with SPE, but not 6×, more like 2× in the case of SineSPE. We provide training times in the appendix.

Note that we shared the positional codes across layers in most of our experiments. This means storing the result of self.spe(q.shape[:2]) and passing it to every attention layer in your network. You can also try sharing across attention heads, which we haven't tried.

Also, due to sample-wise sharing (i.e. among samples within a batch), SPE benefits from large batch sizes. If your batch size is small, you may indeed get a big performance hit.

@cifkao
Copy link
Collaborator

cifkao commented Jun 28, 2021

I would also like to know how can I implement the indexing T=N^2 for images (where did you do it in the lra benchmark?), according to section 2 of the paper.

We did not do this in our experiments, we simply used the 1D indexing that the APE baseline uses. It should be possible to achieve 2D indexing with ConvSPE by passing shape=(batch_size, N, N), but SineSPE is only defined for 1D sequences.

@aliutkus
Copy link
Owner

Dear @lucastononrodrigues, sorry for the delay.

I would add that you can actually implement sineSPE with 2D signals quite straightforwardly. The trick would be to replace the

\sum_{k=1}^K \lambda_{kd}^2 cos(2\pi f_{kd} (m-n) + \theta_{kd} )

In equation (18) of the paper by a 2D - vector $\textbf{f}_{kd}$ and 2D indices $\textbf{m}$ and $\textbf{n}$, so as to yield:

\sum_{k=1}^K \lambda_{kd}^2 cos(2\pi \textbf{f}_{kd}^\top (\textbf{m}-\textbf{n}) + \theta_{kd} )

and this can be implemented pretty straightforwardly by simply replacing the $cos(2\pi a_k n+b_k)$ in the construction of $\textbf{\Omega}$ by a $cos(2\pi \textbf{a}_k^\top \textbf{n}+b_k)$ (same thing for sin), all the rest being identical.

I don't have the time right now to do it but would be glad to get a pull request for it, or we may also collaborate on this further, for instance on a fork of yours that we would merge later ?

I would also be interested of course in identifying what is exactly slowing down your experiments, so that we may work on this.

best

antoine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants