DeepSpeed Sequence Parallel #5266
Replies: 1 comment 1 reply
-
@vijetadeshpande, no, we do not treat the two halves as separate examples; instead, each sample is partitioned along the sequence dimension by the number of ranks in the sequence parallel group. For your specific example, a 2k sequence length should be partitioned so that there are 1k tokens per GPU before the call to the DistributedAttention function. It appears that your code is not splitting the Conceptually, Ulysses divides input data along the sequence dimension before and after the attention block, and along the head dimension within the attention block (as shown in figure 1 of this blog). It is important to note that the client code is expected to perform sequence dimension splitting before invoking DistributedAttention, as demonstrated in Megatron DeepSpeed (link). Please let us know if you need further clarification. |
Beta Was this translation helpful? Give feedback.
-
Questions:
seq_len
dimension splits into 2. And the second half of the sequence then acts as a new example in the batch. Is this expected behavior?I evaluated Llama-7b with a context length of 2k. The loss and PPL values are basically equal to the randomly initialized model. Can someone help me figure out the issue?
I did not find the implementation of Seq Parallel as straightforward as mentioned in THIS blog.
modeling_llama.py
fileloss: 9.453394889831543
PPL: 12751.3818359375
lm_logits: torch.Size([1, 2048, 32000])
#4359 #4199
Beta Was this translation helpful? Give feedback.
All reactions