What is is right way to share weights between multiple stages in pipeline parallelism? #1856

zarzen · 2022-03-22T23:44:23Z

zarzen
Mar 22, 2022

hi there,
I am implementing a BERT model with pipeline parallelism features in DeepSpeed. The last layer of BERT pertaining uses the same embedding weights from the BERTEmbedding layer.

The code would look like the following:

def get_bert_pretrain_layers(args, bert_config: BertConfig):
    """
    To get a sequential model representation

    Assume checkpoint_activation=True, output_all_encoded_layers=False
    """
    layers = [
        PipeBertInputLayer(args),
    ]

    # BertModel Part
    bert_embedding = PipeBertEmbeddings(bert_config)
    layers.append(bert_embedding)
    for _ in range(bert_config.num_hidden_layers):
        bert_encoder_layer = PipeBertLayer(bert_config)
        layers.append(bert_encoder_layer)

    # assume: if not output_all_encoded_layers or checkpoint_activations:
    bert_encoder_final_layernorm = PipeBertLayerNorm(bert_config)
    layers.append(bert_encoder_final_layernorm)

    bert_pooler = PipeBertPooler(bert_config)
    layers.append(bert_pooler)
    # BertModel part ends, got the encoded_layers and pooled_output

    # last layer return the loss
    last_layer = PipeBertPreTrainingHeadsWithLoss(bert_config, 
                    bert_embedding.word_embeddings) # here the last layer share the same weights from first embedding 
    layers.append(last_layer)

    return layers

I suspect that my usage is wrong, because the pipeline will partition the layers into two parts. The second stage will be placed on a different GPU, thus may not be able to access the embedding table from the first stage.
But my script seems to work, I don't know why.

Any suggestions?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is is right way to share weights between multiple stages in pipeline parallelism? #1856

{{title}}

Replies: 0 comments

Select a reply

What is is right way to share weights between multiple stages in pipeline parallelism? #1856

zarzen Mar 22, 2022

Replies: 0 comments

zarzen
Mar 22, 2022