Huge difference in speed when finetuning summarization with different scripts

yilunzhang · August 12, 2021, 8:29pm

From transformers 3.x, I have been using examples/seq2seq/finetune.py (huggingface-transformers/finetune.py at v3.5.1 · microsoft/huggingface-transformers · GitHub) to finetune pegasus model and it’s been working fine.

After upgrading to transformers 4.x, that script has been moved to legacy and/so I’m thinking of using the examples/seq2seq/run_summarization.py (transformers/run_summarization.py at v4.4.2 · huggingface/transformers · GitHub) for the same training.

I moved pieces around to make the old finetune.py work under transformers 4.x as well to have fair comparison. Other than the dataset format difference between the two, I mainly noticed the huge difference in training speed between the two.

For the same training dataset (~6 million data points):

finetune.py: with 4 v100 GPU, taking ~6.5h/epoch.
run_summarization.py with 4 v100 GPU, taking ~15h/epoch.

Upon reading the code, finetune.py from transformers 3.x uses pytorch-lightning but run_summarization.py I believe just uses pytorch. Is this mainly causing the difference in speed? Or are there any signification implementation difference between the two?

I can provide more details if needed.

Thanks!

sgugger · August 13, 2021, 6:11am

The two scripts don’t have the same defaults at all, so there could be plenty of reasons for the differences in speed. Could you tell us what command lines you use to run them in both cases?
Thanks!

yilunzhang · August 13, 2021, 2:52pm

Of course. Thank you for looking into it.

For the tfmr3 finetune.py:

python finetune.py \
    --learning_rate=1e-4 \
    --do_train \
    --do_predict \
    --n_val 1000 \
    --num_train_epochs 1 \
    --val_check_interval 0.25 \
    --max_source_length 512 --max_target_length 56 \
    --freeze_embeds --label_smoothing 0.1 --adafactor --task summarization_xsum \
    --model_name_or_path "tuner007/pegasus_paraphrase" \
    --data_dir {data_dir} \
    --output_dir {output_dir} \
    --gpus 4 \
    --overwrite_output_dir

For the new run_summarization.py:

python tfmr4/run_summarization.py \
    --model_name_or_path "tuner007/pegasus_paraphrase" \
    --cache_dir $CACHE_DIR \
    --train_file $TRAIN_FILE \
    --validation_file $VAL_FILE \
    --test_file $TEST_FILE \
    --output_dir $MODEL_OUTPUT_DIR \
    --learning_rate=1e-4 \
    --num_train_epochs=1 \
    --per_device_train_batch_size=32 \
    --per_device_eval_batch_size=32 \
    --do_train \
    --do_predict \
    --max_source_length 512 \
    --max_target_length 56 \
    --label_smoothing 0.1 \
    --adafactor \
    --overwrite_output_dir

There were a couple configs that exists in finetune.py but no longer in run_summarization.py. I will also look into all the possible configurations for the two scripts and spot any difference.

sgugger · August 13, 2021, 2:54pm

At a first glance, freeze_embeds could explain some of the speed difference, as it’s a huge matrix for which gradients updates and optimizer steps are computed.

yilunzhang · August 13, 2021, 3:03pm

I agree and I have been always wanted to specify that but I don’t see in the new run_summarization.py there’s option to freeze it. Are there any hacks to allow me to do that?

Update: I added the following code from finetune.py to freeze the embedding layers in run_summarization.py:

    def freeze_params(model: nn.Module):
        """Set requires_grad=False for each of model.parameters()"""
        for par in model.parameters():
            par.requires_grad = False

    def freeze_embeds(model):
        """Freeze token embeddings and positional embeddings for bart, just token embeddings for t5."""
        model_type = model.config.model_type

        if model_type == "t5":
            freeze_params(model.shared)
            for d in [model.encoder, model.decoder]:
                freeze_params(d.embed_tokens)
        elif model_type == "fsmt":
            for d in [model.model.encoder, model.model.decoder]:
                freeze_params(d.embed_positions)
                freeze_params(d.embed_tokens)
        else:
            freeze_params(model.model.shared)
            for d in [model.model.encoder, model.model.decoder]:
                freeze_params(d.embed_positions)
                freeze_params(d.embed_tokens)
    
    freeze_embeds(model)

However there’s no significant changes in the training speed .

Topic		Replies	Views
Finetuning Pegasus for summarization task 🤗Transformers	3	1055	October 14, 2020
Fine-tuning Pegasus Models	33	10166	October 14, 2021
Speed up the prediction in transformers models 🤗Transformers	0	674	November 23, 2021
fine-tune Pegasus with xsum using Colab but generation results have no difference 🤗Transformers	0	993	March 8, 2021
Transformers v3.0.0 is out! 🤗Transformers	0	1953	July 7, 2020

Huge difference in speed when finetuning summarization with different scripts

Related topics