Huge difference in speed when finetuning summarization with different scripts

From transformers 3.x, I have been using examples/seq2seq/ (huggingface-transformers/ at v3.5.1 路 microsoft/huggingface-transformers 路 GitHub) to finetune pegasus model and it鈥檚 been working fine.

After upgrading to transformers 4.x, that script has been moved to legacy and/so I鈥檓 thinking of using the examples/seq2seq/ (transformers/ at v4.4.2 路 huggingface/transformers 路 GitHub) for the same training.

I moved pieces around to make the old work under transformers 4.x as well to have fair comparison. Other than the dataset format difference between the two, I mainly noticed the huge difference in training speed between the two.

For the same training dataset (~6 million data points):

  1. with 4 v100 GPU, taking ~6.5h/epoch.
  2. with 4 v100 GPU, taking ~15h/epoch.

Upon reading the code, from transformers 3.x uses pytorch-lightning but I believe just uses pytorch. Is this mainly causing the difference in speed? Or are there any signification implementation difference between the two?

I can provide more details if needed.


The two scripts don鈥檛 have the same defaults at all, so there could be plenty of reasons for the differences in speed. Could you tell us what command lines you use to run them in both cases?

Of course. Thank you for looking into it.

For the tfmr3

python \
    --learning_rate=1e-4 \
    --do_train \
    --do_predict \
    --n_val 1000 \
    --num_train_epochs 1 \
    --val_check_interval 0.25 \
    --max_source_length 512 --max_target_length 56 \
    --freeze_embeds --label_smoothing 0.1 --adafactor --task summarization_xsum \
    --model_name_or_path "tuner007/pegasus_paraphrase" \
    --data_dir {data_dir} \
    --output_dir {output_dir} \
    --gpus 4 \

For the new

python tfmr4/ \
    --model_name_or_path "tuner007/pegasus_paraphrase" \
    --cache_dir $CACHE_DIR \
    --train_file $TRAIN_FILE \
    --validation_file $VAL_FILE \
    --test_file $TEST_FILE \
    --output_dir $MODEL_OUTPUT_DIR \
    --learning_rate=1e-4 \
    --num_train_epochs=1 \
    --per_device_train_batch_size=32 \
    --per_device_eval_batch_size=32 \
    --do_train \
    --do_predict \
    --max_source_length 512 \
    --max_target_length 56 \
    --label_smoothing 0.1 \
    --adafactor \

There were a couple configs that exists in but no longer in I will also look into all the possible configurations for the two scripts and spot any difference.

At a first glance, freeze_embeds could explain some of the speed difference, as it鈥檚 a huge matrix for which gradients updates and optimizer steps are computed.

I agree and I have been always wanted to specify that but I don鈥檛 see in the new there鈥檚 option to freeze it. Are there any hacks to allow me to do that?

Update: I added the following code from to freeze the embedding layers in

    def freeze_params(model: nn.Module):
        """Set requires_grad=False for each of model.parameters()"""
        for par in model.parameters():
            par.requires_grad = False

    def freeze_embeds(model):
        """Freeze token embeddings and positional embeddings for bart, just token embeddings for t5."""
        model_type = model.config.model_type

        if model_type == "t5":
            for d in [model.encoder, model.decoder]:
        elif model_type == "fsmt":
            for d in [model.model.encoder, model.model.decoder]:
            for d in [model.model.encoder, model.model.decoder]:

However there鈥檚 no significant changes in the training speed :frowning:.