Feature/transformer sequence sharding #90

japols · 2024-11-28T12:41:01Z

This PR adds a new sharding strategy shard_sequence for the transformer processor.

The current implementation (shard_heads) alternates between sharding across the sequence and sharding across heads for the sliding window attention mechanism. This requires two all-to-all communication steps per layer.

The shard_sequence strategy simplifies this process by keeping a sequence shard on each GPU and computing the sliding window attention locally. This requires a halo exchange to exchange overlapping window segments (halos) between neighboring sequence shards.

Instead of 2 all-to-all communication steps per layer, the halo exchange only requires a single point-to-point communication between neighbouring GPUs, hopefully reducing communication time and improving scalability of model sharding across multiple GPUs.

FussyDuck · 2024-11-28T12:41:08Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ japols
❌ Jan Patrick Polster

Jan Patrick Polster seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

codecov-commenter · 2024-11-28T12:52:51Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.85%. Comparing base (fd2bcf1) to head (4ab4205).

Additional details and impacted files

@@           Coverage Diff            @@
##           develop      #90   +/-   ##
========================================
  Coverage    99.85%   99.85%           
========================================
  Files           23       23           
  Lines         1374     1374           
========================================
  Hits          1372     1372           
  Misses           2        2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

japols and others added 3 commits October 29, 2024 10:49

feat: Initial transformer sequence sharding version

e3c5283

feat: shard_strategy configurable via config.model.processor

e727e3c

Merge branch 'develop' into feature/transformer_sequence_sharding

4ab4205

japols self-assigned this Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/transformer sequence sharding #90

Feature/transformer sequence sharding #90

japols commented Nov 28, 2024

FussyDuck commented Nov 28, 2024

codecov-commenter commented Nov 28, 2024

Feature/transformer sequence sharding #90

Are you sure you want to change the base?

Feature/transformer sequence sharding #90

Conversation

japols commented Nov 28, 2024

FussyDuck commented Nov 28, 2024

codecov-commenter commented Nov 28, 2024

Codecov Report