About the Cross-attention Layer Shape in Encoder-Decoder Model

I am trying to intialize a bert2bert model with bert-base-uncased as encoder and bert-large-uncased as decoder with the following codes:

from transformers import EncoderDecoderModel, BertTokenizer
import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = EncoderDecoderModel.from_encoder_decoder_pretrained(
    "bert-base-uncased", "bert-large-uncased") 

When I print the all layers of the model, the results are:

...
(attention): BertAttention(
  (self): BertSelfAttention(
    (query): Linear(in_features=1024, out_features=1024, bias=True)
    (key): Linear(in_features=1024, out_features=1024, bias=True)
    (value): Linear(in_features=1024, out_features=1024, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (output): BertSelfOutput(
    (dense): Linear(in_features=1024, out_features=1024, bias=True)
    (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
)
(crossattention): BertAttention(
  (self): BertSelfAttention(
    (query): Linear(in_features=1024, out_features=1024, bias=True)
    (key): Linear(in_features=1024, out_features=1024, bias=True)
    (value): Linear(in_features=1024, out_features=1024, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (output): BertSelfOutput(
    (dense): Linear(in_features=1024, out_features=1024, bias=True)
    (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
)
...

For the cross attention part, I think the shape for query and key should be (768,1024) instead of (1024,1024), since it converts the bert-base output dim (768) to bert-large dim (1024). Who can tell me where I made a mistake and give me a explanation? Many thanks!

I found the codes in EncoderDecoderModel class that maps the encoder hidden state size to decoder hidden state size in the link here. The problem solved.

1 Like