Initialize masked language model with RobertaForMaskLM missing intermediate_act_fn layer

stewart1 · July 7, 2023, 1:15am

i am trying to train a masked language model from scratch. i use before code to create the roberta model architecture. but when I compare it with RobertaLM, I found it does not have the GELU activation layer. could someone help explain how to correctly do this? thanks

config = RobertaConfig(
vocab_size= 50265,
max_position_embeddings=514,
num_attention_heads=12,
num_hidden_layers=12,
type_vocab_size=1,
)
model = RobertaForMaskedLM(config=config)

the roberta layer looks like this:

------ a lot of layers…
(encoder): RobertaEncoder(
(layer): ModuleList(
(0): RobertaLayer(
(attention): RobertaAttention(
(self): RobertaSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): RobertaSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): RobertaIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
--------- missing gelu activation layer here
)
---------- a lot of layers…

when i load a pretrained it looks like this:
-------- a lot of layers…
(encoder): RobertaEncoder(
(layer): ModuleList(
(0): RobertaLayer(
(attention): RobertaAttention(
(self): RobertaSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): RobertaSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): RobertaIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
-----------a lot of layers…

stewart1 · July 18, 2023, 4:51pm

Hi huggingface community,

could I someone help answer my questions? why the gelu activation is missing? if i understand correctly, this means the whole network is now just a giant linear function without non-linearity.
thanks for your insights.

Stuart

the missing layer highlighted in below image:

Topic		Replies	Views
BERT: AttributeError: 'RobertaForMaskedLM' object has no attribute 'bert' Beginners	12	6431	July 25, 2021
Change embedding layer in a model Languages at Hugging Face	2	1553	July 18, 2024
[URGENT] Issues with Training RoBERTa Model for Text Prediction with Fill Mask Task 🤗Transformers	6	216	March 19, 2024
PreTrain RoBERTa from scratch in Portuguese Flax/JAX Projects	16	2431	October 4, 2021
How to use `.modules()` command to get all the parameters that pertains to the uppermost layer of `roberta-large` model? 🤗Transformers	1	4104	August 10, 2020

Initialize masked language model with RobertaForMaskLM missing intermediate_act_fn layer

Related topics