BigBirdPegasus with attention_type="original_full" vs T5

I’ve noticed that BigBirdPegasus with attention_type=“original_full” has fewer parameters and trains faster than a T5 model with equal hyperparameters. (When I say that it trains faster, I mean that it completes a training epoch faster. I don’t know whether it actually converges faster.)

Example:

import transformers

conf = transformers.T5Config(vocab_size=30000,
                             max_length=768,
                             max_position_embeddings=768,
                             d_model=768,
                             d_kv=64,
                             num_heads=768//64,
                             num_layers=8,
                             dropout_rate=0.0,
                             d_ff=3072,
                             feed_forward_proj='gated-gelu',
                             use_cache=False,
                             )
M = transformers.T5ForConditionalGeneration(conf)
print(M.num_parameters())  # 192942336

conf = transformers.BigBirdPegasusConfig(vocab_size=30000,
                                         max_position_embeddings=768,
                                         encoder_layers=8,
                                         encoder_ffn_dim=3072,
                                         encoder_attention_heads=768//64,
                                         decoder_layers=8,
                                         decoder_ffn_dim=3072,
                                         decoder_attention_heads=768//64,
                                         use_cache=False,
                                         d_model=768,
                                         attention_type='original_full'
                                         )

M2 = transformers.BigBirdPegasusForConditionalGeneration(conf)
print(M2.num_parameters())  # 156466176

The T5 model has about 23% more parameters than the BigBirdPegasus model.

Why is that?