You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
hi there,
I am implementing a BERT model with pipeline parallelism features in DeepSpeed. The last layer of BERT pertaining uses the same embedding weights from the BERTEmbedding layer.
The code would look like the following:
defget_bert_pretrain_layers(args, bert_config: BertConfig):
""" To get a sequential model representation Assume checkpoint_activation=True, output_all_encoded_layers=False """layers= [
PipeBertInputLayer(args),
]
# BertModel Partbert_embedding=PipeBertEmbeddings(bert_config)
layers.append(bert_embedding)
for_inrange(bert_config.num_hidden_layers):
bert_encoder_layer=PipeBertLayer(bert_config)
layers.append(bert_encoder_layer)
# assume: if not output_all_encoded_layers or checkpoint_activations:bert_encoder_final_layernorm=PipeBertLayerNorm(bert_config)
layers.append(bert_encoder_final_layernorm)
bert_pooler=PipeBertPooler(bert_config)
layers.append(bert_pooler)
# BertModel part ends, got the encoded_layers and pooled_output# last layer return the losslast_layer=PipeBertPreTrainingHeadsWithLoss(bert_config,
bert_embedding.word_embeddings) # here the last layer share the same weights from first embedding layers.append(last_layer)
returnlayers
I suspect that my usage is wrong, because the pipeline will partition the layers into two parts. The second stage will be placed on a different GPU, thus may not be able to access the embedding table from the first stage.
But my script seems to work, I don't know why.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
hi there,
I am implementing a BERT model with pipeline parallelism features in DeepSpeed. The last layer of BERT pertaining uses the same embedding weights from the
BERTEmbedding
layer.The code would look like the following:
I suspect that my usage is wrong, because the pipeline will partition the layers into two parts. The second stage will be placed on a different GPU, thus may not be able to access the embedding table from the first stage.
But my script seems to work, I don't know why.
Any suggestions?
Beta Was this translation helpful? Give feedback.
All reactions