Huge difference in speed when finetuning summarization with different scripts

From transformers 3.x, I have been using examples/seq2seq/finetune.py (huggingface-transformers/finetune.py at v3.5.1 路 microsoft/huggingface-transformers 路 GitHub) to finetune pegasus model and it鈥檚 been working fine.

After upgrading to transformers 4.x, that script has been moved to legacy and/so I鈥檓 thinking of using the examples/seq2seq/run_summarization.py (transformers/run_summarization.py at v4.4.2 路 huggingface/transformers 路 GitHub) for the same training.

I moved pieces around to make the old finetune.py work under transformers 4.x as well to have fair comparison. Other than the dataset format difference between the two, I mainly noticed the huge difference in training speed between the two.

For the same training dataset (~6 million data points):

  1. finetune.py: with 4 v100 GPU, taking ~6.5h/epoch.
  2. run_summarization.py with 4 v100 GPU, taking ~15h/epoch.

Upon reading the code, finetune.py from transformers 3.x uses pytorch-lightning but run_summarization.py I believe just uses pytorch. Is this mainly causing the difference in speed? Or are there any signification implementation difference between the two?

I can provide more details if needed.

Thanks!

The two scripts don鈥檛 have the same defaults at all, so there could be plenty of reasons for the differences in speed. Could you tell us what command lines you use to run them in both cases?
Thanks!

Of course. Thank you for looking into it.

For the tfmr3 finetune.py:

python finetune.py \
    --learning_rate=1e-4 \
    --do_train \
    --do_predict \
    --n_val 1000 \
    --num_train_epochs 1 \
    --val_check_interval 0.25 \
    --max_source_length 512 --max_target_length 56 \
    --freeze_embeds --label_smoothing 0.1 --adafactor --task summarization_xsum \
    --model_name_or_path "tuner007/pegasus_paraphrase" \
    --data_dir {data_dir} \
    --output_dir {output_dir} \
    --gpus 4 \
    --overwrite_output_dir

For the new run_summarization.py:

python tfmr4/run_summarization.py \
    --model_name_or_path "tuner007/pegasus_paraphrase" \
    --cache_dir $CACHE_DIR \
    --train_file $TRAIN_FILE \
    --validation_file $VAL_FILE \
    --test_file $TEST_FILE \
    --output_dir $MODEL_OUTPUT_DIR \
    --learning_rate=1e-4 \
    --num_train_epochs=1 \
    --per_device_train_batch_size=32 \
    --per_device_eval_batch_size=32 \
    --do_train \
    --do_predict \
    --max_source_length 512 \
    --max_target_length 56 \
    --label_smoothing 0.1 \
    --adafactor \
    --overwrite_output_dir

There were a couple configs that exists in finetune.py but no longer in run_summarization.py. I will also look into all the possible configurations for the two scripts and spot any difference.

At a first glance, freeze_embeds could explain some of the speed difference, as it鈥檚 a huge matrix for which gradients updates and optimizer steps are computed.

I agree and I have been always wanted to specify that but I don鈥檛 see in the new run_summarization.py there鈥檚 option to freeze it. Are there any hacks to allow me to do that?

Update: I added the following code from finetune.py to freeze the embedding layers in run_summarization.py:

    def freeze_params(model: nn.Module):
        """Set requires_grad=False for each of model.parameters()"""
        for par in model.parameters():
            par.requires_grad = False

    def freeze_embeds(model):
        """Freeze token embeddings and positional embeddings for bart, just token embeddings for t5."""
        model_type = model.config.model_type

        if model_type == "t5":
            freeze_params(model.shared)
            for d in [model.encoder, model.decoder]:
                freeze_params(d.embed_tokens)
        elif model_type == "fsmt":
            for d in [model.model.encoder, model.model.decoder]:
                freeze_params(d.embed_positions)
                freeze_params(d.embed_tokens)
        else:
            freeze_params(model.model.shared)
            for d in [model.model.encoder, model.model.decoder]:
                freeze_params(d.embed_positions)
                freeze_params(d.embed_tokens)
    
    freeze_embeds(model)

However there鈥檚 no significant changes in the training speed :frowning:.