Fine-tuning Pegasus

DeathTruck · October 8, 2020, 8:31pm

Hi I’ve been using the Pegasus model over the past 2 weeks and have gotten some very good results. I would like to fine-tune the model further so that the performance is more tailored for my use-case.

I have some code up and running that uses Trainer. However, when looking at examples, the model does worse after training. In fact, the model output has a lot of repeating strings, the more the model is trained (i.e., more epochs). I’m wondering if my implementation is wrong, or if Trainer is not suitable for fine-tuning Pegasus (‘google/pegasus-xsum’). Am I running into catastrophic forgetting?

My code is not long, I’ve attached it below. I mostly used the tutorial(s) from:

Thanks!!!

import pandas as pd
in_df = pd.read_csv('/content/drive/My Drive/summaries_sample.csv')

# Train Test Split
train_pct = 0.6
test_pct = 0.2

in_df = in_df.sample(len(in_df), random_state=20)
train_sub = int(len(in_df) * train_pct)
test_sub = int(len(in_df) * test_pct) + train_sub

train_df = in_df[0:train_sub]
test_df = in_df[train_sub:test_sub]
val_df = in_df[test_sub:]

train_texts = list(train_df['allTextReprocess'])
test_texts = list(test_df['allTextReprocess'])
val_texts = list(val_df['allTextReprocess'])

train_decode = list(train_df['summaries'])
test_decode = list(test_df['summaries'])
val_decode = list(val_df['summaries'])

import transformers

import torch
min_length = 15
max_length = 40

# Setup model
model_name = 'google/pegasus-xsum'
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = transformers.PegasusTokenizer.from_pretrained(model_name)

model = transformers.PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)
in_text = [in_df['allTextReprocess'].iloc[3]]
batch = tokenizer.prepare_seq2seq_batch(in_text, truncation=True, padding='longest').to(torch_device) 

translated = model.generate(min_length=min_length, max_length=max_length, **batch)
tgt_text0 = tokenizer.batch_decode(translated, skip_special_tokens=True)
print(tgt_text0)

# Tokenize
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

train_labels = tokenizer(train_decode, truncation=True, padding=True)
val_labels = tokenizer(val_decode, truncation=True, padding=True)
test_labels = tokenizer(test_decode, truncation=True, padding=True)

# Setup dataset objects
class Summary_dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels['input_ids'][idx])  # torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.encodings)

train_dataset = Summary_dataset(train_encodings, train_labels)
val_dataset = Summary_dataset(val_encodings, val_labels)
test_dataset = Summary_dataset(test_encodings, test_labels)

# Training
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1000,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

trainer.train()

# Check results
in_text = [in_df['allTextReprocess'].iloc[3]]
batch = tokenizer.prepare_seq2seq_batch(in_text, truncation=True, padding='longest').to(torch_device) 

translated = model.generate(min_length=min_length, max_length=max_length, **batch)
tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
print(tgt_text)

Any help would be awesome, thanks!

cantino · October 13, 2020, 11:33pm

I also want to finetune Pegasus. Thank you for sharing your code! How similar is this to what happens in https://github.com/huggingface/transformers/blob/master/examples/seq2seq/finetune_pegasus_xsum.sh ?

valhalla · October 14, 2020, 9:25am

Could try this with the examples/seq2seq scripts ?
also we have recently added Trainer support for seq2seq tasks as well.
see
https://github.com/huggingface/transformers/blob/master/examples/seq2seq/finetune_trainer.py
and
https://github.com/huggingface/transformers/tree/master/examples/seq2seq/builtin_trainer

DeathTruck · October 27, 2020, 3:51pm

Thanks for the response, and sorry for my delayed reply. Using Trainer is the big difference; abstracts a lot of the code away. It seems to be working now, though, so that’s good.

DeathTruck · October 27, 2020, 3:53pm

Thank you, and sorry for my delayed reply. My code seems to work, I think there were some bad examples in my input sequences for training. Removing those helped. After fine-tuning, I was able to get rid of a lot of cases where the model would give repeating text and randomly output something about the BBC.

I’m not sure if it is necessary, but do you know if there is a way to freeze layers using Trainer?

cantino · October 27, 2020, 5:16pm

Thanks @DeathTruck. Would you be open to sharing your working Trainer code that I could use as a starting place, or is that the code you’ve already shared?

DeathTruck · October 27, 2020, 5:56pm

Yeah, so the code I pasted here should work. My problem initially was that I was feeding it some bad examples, which I believe was causing the problems. My best results have come with about 1000 training samples and 1000 epochs and lr=5E-5.

Let me know if you encounter any problems with the code.

valhalla · October 29, 2020, 5:19pm

finetune_trainer script let’s you freeze embeddings layer and encoder using --freeze_embeds and --freeze_encoder arguments

DeathTruck · November 5, 2020, 8:52pm

Ok, thank you! I didn’t want to completely throw out my code that I posted here, but I wound up using the freezing code in examples/seq2seq/utils.py to freeze either the embedding or encoder layers, before passing the model to Trainer. Seems to work. Thanks!

agenius5 · January 4, 2021, 1:29pm

@DeathTruck Hi, Can you please share your code and tell me how did you freeze encoder layers? I am trying to do the same but can’t figure it out.

valhalla · January 6, 2021, 8:48am

Hi @agenius5

You can pass --freeze_encoder flag to finetune_trainer.py script to freeze all encoder layers.

pikaduck · March 23, 2021, 8:18am

Hi @DeathTruck
I’m trying to finetune PEGASUS on big_patent and I could really use some help. Could you share your code so that I can get an idea of how i could go about doing that?

pikaduck · April 7, 2021, 12:57pm

@valhalla @DeathTruck With reference to the code pasted in the question. I tried using it

tokenizer = transformers.PegasusTokenizer.from_pretrained(model_name)
train_desc = list(train_df['description'])
train_encodings = tokenizer(train_desc, truncation = True, padding = True)

But these give me the following error " ‘NoneType’ object is not callable" for the last line where i basically call tokenizer when i do it in a colab notebook. Although on my local jupyter notebook, it doesn’t throw any error.

Please help me out here

valhalla · April 13, 2021, 11:14am

Hi @pikaduck

PegasusTokenizer needs sentencepiece to be installed. So make sure to pip install sentencepiece and restart the notebook/colab (if you are using that) and then call the tokenizer.

pikaduck · April 14, 2021, 10:15am

I do have sentencepiece installed

valhalla · April 14, 2021, 2:01pm

Hmm, since its working on local and not on colab then I guess it was missing and colab needs to be restarted after installing sentencepiece

DeathTruck · April 14, 2021, 3:25pm

@pikaduck Sorry for my late replies.

lately, I’ve been installing different versions of the packages:
!pip install torch==“1.7.1”
and
!pip install transformers==“3.4.0”

DeathTruck · April 14, 2021, 3:29pm

@pikaduck Also, for freezing the layers with Trainer, I used torch. Mostly, copying and pasting certain parts of the finetune_trainer.py code that @valhalla mentioned. Let me know, and I can show you my code.

pikaduck · April 20, 2021, 4:13am

Yeah i did realize it was a version problem. Fixed it!

pikaduck · April 20, 2021, 4:16am

Hey if you could show the code with which you froze the layers with trainer, it’d be super awesome. Also, I’m facing memory problems while training without freezing any layers on google colab. Do you think that could get fixed if some layers are frozen?

Topic		Replies	Views
Fine-Tuning Pegasus - Model Not Training? Models	4	1741	March 14, 2021
Finetuning Pegasus for summarization task 🤗Transformers	3	1050	October 14, 2020
Pegasus Questions 🤗Transformers	29	3951	July 5, 2021
Pegasus finetuning, should we always start with pegasus-large? Beginners	5	1677	May 3, 2024
fine-tune Pegasus with xsum using Colab but generation results have no difference 🤗Transformers	0	993	March 8, 2021

Fine-tuning Pegasus

Related topics