BertGeneration trains over 2x faster than T5ForConditionalGeneration. Am I doing something wrong?

mmalandro · August 19, 2022, 4:14am

I am working on training some sequence-to-sequence models from scratch. I’m using a custom tokenizer and custom dataset. Given that I am starting from scratch, I would think that choosing a BertGeneration model (using relative attention) or a T5ForConditionalGeneration model to train would be roughly equivalent. However…

VOCAB_SIZE = 20000
in_length = 768
d_model = 512
conf = transformers.BertGenerationConfig(vocab_size=VOCAB_SIZE,
                                         pad_token_id=tokenizer.pad_id(),
                                         bos_token_id=tokenizer.bos_id(),
                                         eos_token_id=tokenizer.eos_id(),
                                         max_position_embeddings=in_length,
                                         hidden_size=d_model,
                                         num_attention_heads=d_model//64,
                                         num_hidden_layers=6,
                                         intermediate_size=d_model*4,
                                         hidden_dropout_prob=0.0,
                                         attention_probs_dropout_prob=0.0,
                                         hidden_act='gelu_new',
                                         position_embedding_type='relative_key',
                                         use_cache=False,
                                        )
encoder = transformers.BertGenerationEncoder(conf)
decoder = transformers.BertGenerationDecoder(conf)
        
M = transformers.EncoderDecoderModel(encoder=encoder, decoder=decoder)
M.config.decoder_start_token_id = tokenizer.bos_id()
M.config.pad_token_id = tokenizer.pad_id()

The above creates a Bert encoder-decoder model with about 60 million parameters. With a batch size of 12, using the Huggingface Trainer I can train it on 50k examples in about 16 minutes on a p100.

If I create a T5ForConditionalGeneration with an analogous configuration (using “gated-gelu” instead of “gelu_new”, for instance), I get a model with about 66 million parameters, and with the same batch size it takes me over 38 minutes to train the same number of examples. GPU memory usage is also higher when training the T5 model than when training the Bert model.

Why is this?

mmalandro · August 19, 2022, 5:41pm

I think the code for creating the model above is wrong.

If I do

config = transformers.EncoderDecoderConfig.from_encoder_decoder_configs(conf, conf)
M = transformers.EncoderDecoderModel(config=config)

then I get a model with ~72 million parameters that takes a similar amount of time and memory to train as the T5 model. I’m not sure why the T5 model has ~6 million fewer parameters, but this all feels generally correct now.

Topic		Replies	Views
Difference between EncoderDecoder and BertGeneration Beginners	0	220	August 12, 2023
BERT for Generative Chatbot 🤗Transformers	1	559	July 26, 2021
T5forConditionalGeneration Beginners	2	2235	September 15, 2020
T5 Seq2Seq custom fine-tuning Models	7	3712	November 30, 2020
Problem generating with T5ForConditionalGeneration on a custom task 🤗Transformers	2	40	January 26, 2025

BertGeneration trains over 2x faster than T5ForConditionalGeneration. Am I doing something wrong?

Related topics