I’ve noticed that BigBirdPegasus with attention_type=“original_full” has fewer parameters and trains faster than a T5 model with equal hyperparameters. (When I say that it trains faster, I mean that it completes a training epoch faster. I don’t know whether it actually converges faster.)
Example:
import transformers
conf = transformers.T5Config(vocab_size=30000,
max_length=768,
max_position_embeddings=768,
d_model=768,
d_kv=64,
num_heads=768//64,
num_layers=8,
dropout_rate=0.0,
d_ff=3072,
feed_forward_proj='gated-gelu',
use_cache=False,
)
M = transformers.T5ForConditionalGeneration(conf)
print(M.num_parameters()) # 192942336
conf = transformers.BigBirdPegasusConfig(vocab_size=30000,
max_position_embeddings=768,
encoder_layers=8,
encoder_ffn_dim=3072,
encoder_attention_heads=768//64,
decoder_layers=8,
decoder_ffn_dim=3072,
decoder_attention_heads=768//64,
use_cache=False,
d_model=768,
attention_type='original_full'
)
M2 = transformers.BigBirdPegasusForConditionalGeneration(conf)
print(M2.num_parameters()) # 156466176
The T5 model has about 23% more parameters than the BigBirdPegasus model.
Why is that?