About the Cross-attention Layer Shape in Encoder-Decoder Model

zhkleciel · March 17, 2022, 4:45pm

I am trying to intialize a bert2bert model with bert-base-uncased as encoder and bert-large-uncased as decoder with the following codes:

from transformers import EncoderDecoderModel, BertTokenizer
import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = EncoderDecoderModel.from_encoder_decoder_pretrained(
    "bert-base-uncased", "bert-large-uncased")

When I print the all layers of the model, the results are:

...
(attention): BertAttention(
  (self): BertSelfAttention(
    (query): Linear(in_features=1024, out_features=1024, bias=True)
    (key): Linear(in_features=1024, out_features=1024, bias=True)
    (value): Linear(in_features=1024, out_features=1024, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (output): BertSelfOutput(
    (dense): Linear(in_features=1024, out_features=1024, bias=True)
    (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
)
(crossattention): BertAttention(
  (self): BertSelfAttention(
    (query): Linear(in_features=1024, out_features=1024, bias=True)
    (key): Linear(in_features=1024, out_features=1024, bias=True)
    (value): Linear(in_features=1024, out_features=1024, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (output): BertSelfOutput(
    (dense): Linear(in_features=1024, out_features=1024, bias=True)
    (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
)
...

For the cross attention part, I think the shape for query and key should be (768,1024) instead of (1024,1024), since it converts the bert-base output dim (768) to bert-large dim (1024). Who can tell me where I made a mistake and give me a explanation? Many thanks!

zhkleciel · March 18, 2022, 2:25am

I found the codes in EncoderDecoderModel class that maps the encoder hidden state size to decoder hidden state size in the link here. The problem solved.

Topic		Replies	Views
BERT: What is the shape of each Transformer Encoder block in the final hidden state? Intermediate	7	12923	March 16, 2022
Why BertForMaskedLM has decoder layer 🤗Transformers	2	821	August 17, 2021
Sizes of Query, key and value vector in Bert Model 🤗Transformers	3	5988	March 25, 2021
Difference between transformer encoder and decoder Models	1	11831	March 12, 2021
Adding cross-attention to custom models 🤗Transformers	2	3566	October 21, 2022

About the Cross-attention Layer Shape in Encoder-Decoder Model

Related topics