BertGeneration trains over 2x faster than T5ForConditionalGeneration. Am I doing something wrong?

I am working on training some sequence-to-sequence models from scratch. I’m using a custom tokenizer and custom dataset. Given that I am starting from scratch, I would think that choosing a BertGeneration model (using relative attention) or a T5ForConditionalGeneration model to train would be roughly equivalent. However…

VOCAB_SIZE = 20000
in_length = 768
d_model = 512
conf = transformers.BertGenerationConfig(vocab_size=VOCAB_SIZE,
                                         pad_token_id=tokenizer.pad_id(),
                                         bos_token_id=tokenizer.bos_id(),
                                         eos_token_id=tokenizer.eos_id(),
                                         max_position_embeddings=in_length,
                                         hidden_size=d_model,
                                         num_attention_heads=d_model//64,
                                         num_hidden_layers=6,
                                         intermediate_size=d_model*4,
                                         hidden_dropout_prob=0.0,
                                         attention_probs_dropout_prob=0.0,
                                         hidden_act='gelu_new',
                                         position_embedding_type='relative_key',
                                         use_cache=False,
                                        )
encoder = transformers.BertGenerationEncoder(conf)
decoder = transformers.BertGenerationDecoder(conf)
        
M = transformers.EncoderDecoderModel(encoder=encoder, decoder=decoder)
M.config.decoder_start_token_id = tokenizer.bos_id()
M.config.pad_token_id = tokenizer.pad_id()

The above creates a Bert encoder-decoder model with about 60 million parameters. With a batch size of 12, using the Huggingface Trainer I can train it on 50k examples in about 16 minutes on a p100.

If I create a T5ForConditionalGeneration with an analogous configuration (using “gated-gelu” instead of “gelu_new”, for instance), I get a model with about 66 million parameters, and with the same batch size it takes me over 38 minutes to train the same number of examples. GPU memory usage is also higher when training the T5 model than when training the Bert model.

Why is this?

I think the code for creating the model above is wrong.

If I do

config = transformers.EncoderDecoderConfig.from_encoder_decoder_configs(conf, conf)
M = transformers.EncoderDecoderModel(config=config)

then I get a model with ~72 million parameters that takes a similar amount of time and memory to train as the T5 model. I’m not sure why the T5 model has ~6 million fewer parameters, but this all feels generally correct now.