[Feature request] Efficient stochastic path without unused layers #294

Aceticia · 2024-11-18T03:25:10Z

DINO v2 finds that high values of stochastic depth is very helpful for larger models in terms of performance and they also gave an efficient implementation that only operates on the un-masked samples of a batch here, which is very simple:

def drop_add_residual_stochastic_depth(
    x: Tensor,
    residual_func: Callable[[Tensor], Tensor],
    sample_drop_ratio: float = 0.0,
) -> Tensor:
    # 1) extract subset using permutation
    b, n, d = x.shape
    sample_subset_size = max(int(b * (1 - sample_drop_ratio)), 1)
    brange = (torch.randperm(b, device=x.device))[:sample_subset_size]
    x_subset = x[brange]

    # 2) apply residual_func to get residual
    residual = residual_func(x_subset)

    x_flat = x.flatten(1)
    residual = residual.flatten(1)

    residual_scale_factor = b / sample_subset_size

    # 3) add the residual
    x_plus_residual = torch.index_add(x_flat, 0, brange, residual.to(dtype=x.dtype), alpha=residual_scale_factor)
    return x_plus_residual.view_as(x)

In practice, with up to stochastic_depth=0.4, the memory usage almost halves.

In this repo, there is a stochastic depth provided, where the layers are dropped altogether. This also achieves similar effect as the DINO v2 implementation in that masked out samples of a batch don't waste compute. However, this drops entire layers and thus we are forced to use find_unused_parameters=True when training with DDP, which would cause further overheads... besides, dropping entire layer across all batches feels kinda weird and might introduce biases.

I can contribute something and integrate this into the attention and MLP layers. What do you think? Is there any other reasons that you keep the entire layer drop (apart from the potential overhead when drop is low)?

The text was updated successfully, but these errors were encountered:

lucidrains · 2024-11-18T21:55:25Z

@Aceticia hello again Xujin/Chris

stochastic depth is popular in some circles for sure

what do you think about just forcing the parameters to be used by sending in a single dummy token, multiplying the output by 0 and summing it to the stream? that should fix the ddp issue?

Aceticia · 2024-11-19T00:59:50Z

Hello again! I go by Chris :D

Sounds like a good solution, I can't really think of any side effects.

lucidrains · 2024-11-19T15:13:23Z

@Aceticia ok Chris i'll add it later this evening and you can let me know if that unused parameters issue persists

lucidrains · 2024-11-19T21:23:36Z

@Aceticia did you see anything interesting when splitting dimensions for alibi across heads?

Aceticia · 2024-11-19T21:40:50Z

@Aceticia did you see anything interesting when splitting dimensions for alibi across heads?

I tried it out, didn't have time for a complete run but sadly I don't see much differences from just using alibi in time. We made the compromise to use consistent time ordering across samples and use rotary pos emb in time, and a learned positional embedding across space and it's the best we have yet.

Can't spend forever on this - sorry to have wasted some of your time on this. Good knowledge though.

lucidrains · 2024-11-19T21:43:49Z

@Aceticia no problem! just your sharing this makes it worth it

thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] Efficient stochastic path without unused layers #294

[Feature request] Efficient stochastic path without unused layers #294

Aceticia commented Nov 18, 2024

lucidrains commented Nov 18, 2024

Aceticia commented Nov 19, 2024

lucidrains commented Nov 19, 2024

lucidrains commented Nov 19, 2024

Aceticia commented Nov 19, 2024

lucidrains commented Nov 19, 2024

[Feature request] Efficient stochastic path without unused layers #294

[Feature request] Efficient stochastic path without unused layers #294

Comments

Aceticia commented Nov 18, 2024

lucidrains commented Nov 18, 2024

Aceticia commented Nov 19, 2024

lucidrains commented Nov 19, 2024

lucidrains commented Nov 19, 2024

Aceticia commented Nov 19, 2024

lucidrains commented Nov 19, 2024