DeepSpeed Sequence Parallel #5266

vijetadeshpande · 2024-03-13T06:47:37Z

vijetadeshpande
Mar 13, 2024

Questions:

How does sequence parallelism actually work? Can someone point me to some tutorial/blogs about it?

In the following python snippet, I observed that for each example in the batch, the seq_len dimension splits into 2. And the second half of the sequence then acts as a new example in the batch. Is this expected behavior?
I do not understand how treating two halves of a sequence as separate examples is equivalent to a longer context. What am I missing?

I evaluated Llama-7b with a context length of 2k. The loss and PPL values are basically equal to the randomly initialized model. Can someone help me figure out the issue?
I did not find the implementation of Seq Parallel as straightforward as mentioned in THIS blog.

I made quite a few changes to the modeling_llama.py file
I had to make a local copy of DistributedAttention and make changes to it
Am I missing something or is this a common experience?

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '2'))

if __name__ == "__main__":

    # create the model
    model_name = "meta-llama/Llama-2-7b-hf"
    tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
    model =  LlamaForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, use_flash_attention_2=True)#, load_in_8bit=True)
    model.eval()
    config = model.config
    print(model)

    # Initialize the DeepSpeed-Inference engine
    ds_engine = deepspeed.init_inference(
        model,
        mp_size=2,
        dtype=torch.bfloat16,
    )
    model = ds_engine.module

    # patch the model with DistributedAttention
    local_comm.initialize_model_parallel(sequence_parallel_size=world_size)
    for layer_idx, layer in enumerate(model.model.layers):
        assert hasattr(layer, 'self_attn') and hasattr(layer.self_attn, 'rotary_emb')
        layer.self_attn.deepspeed_enabled = True
        layer.deepspeed_enabled = True
        dist_attn = DistributedAttention(
            local_attention=layer.self_attn,
            sequence_process_group=local_comm.get_sequence_parallel_group(),
        )
        layer.dist_attn = dist_attn

    model.to(local_rank)
    print(model)

    # Load the data
    data = ...
    sample_instance = data[0][:2048]

    # Perform text generation
    outputs = model(input_ids=sample_instance, labels=sample_instance)
    print(f"loss: {outputs.loss}")
    print(f"PPL: {torch.exp(outputs.loss)}")
    print(f"lm_logits: {outputs.logits.shape}")

loss: 9.453394889831543
PPL: 12751.3818359375
lm_logits: torch.Size([1, 2048, 32000])

#4359 #4199

samadejacobs · 2024-03-19T14:41:12Z

samadejacobs
Mar 19, 2024
Collaborator

@vijetadeshpande, no, we do not treat the two halves as separate examples; instead, each sample is partitioned along the sequence dimension by the number of ranks in the sequence parallel group. For your specific example, a 2k sequence length should be partitioned so that there are 1k tokens per GPU before the call to the DistributedAttention function. It appears that your code is not splitting the sample_instance along the sequence dimension.

Conceptually, Ulysses divides input data along the sequence dimension before and after the attention block, and along the head dimension within the attention block (as shown in figure 1 of this blog). It is important to note that the client code is expected to perform sequence dimension splitting before invoking DistributedAttention, as demonstrated in Megatron DeepSpeed (link).

Please let us know if you need further clarification.

1 reply

puppet101 Mar 20, 2024

Is there an example code for adapting the Ulysses to HF's transformer library? It is just too difficult for me to implement it. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSpeed Sequence Parallel #5266

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

DeepSpeed Sequence Parallel #5266

vijetadeshpande Mar 13, 2024

Replies: 1 comment · 1 reply

samadejacobs Mar 19, 2024 Collaborator

puppet101 Mar 20, 2024

vijetadeshpande
Mar 13, 2024

Replies: 1 comment 1 reply

samadejacobs
Mar 19, 2024
Collaborator