How to train GPT-2 for text summarization?

rishikesh · February 15, 2023, 12:17am

I have scrapped some data wherein I have some text paragraphs followed by one line summary. I am trying to finetune GPT-2 using this dataset for text summarization. I followed the demo available for text summarization at link - It works perfectly fine, however, uses T5 model. So, I replaced T5 model and corresponding tokenzier with ‘GPT-2 medium’ model and GPT tokenizer. The data preprocessing code I used is exactly the same as the one given for T5 as in the tutorial referenced above. But, it does not work and throws errors.
Below is the code I am using :

# define tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2-medium", model_max_length = 512)

# define model
# model = AutoModelForSeq2SeqLM.from_pretrained("gpt2-medium")
model = GPT2LMHeadModel.from_pretrained('gpt2-medium', pad_token_id=tokenizer.eos_token_id)

# preprocess input
prefix = "summarize: "
def preprocess_function(examples):
  inputs = [ prefix + doc for doc in examples['abstract'] ]
  model_inputs = tokenizer(inputs, max_length=512)

  labels = tokenizer(text_target=examples['title'], max_length=128)
  model_inputs['labels'] = labels['input_ids']
  
  return model_inputs

tokenized_train = train_ds.map(preprocess_function, batched = True )
tokenized_test = test_ds.map(preprocess_function, batched=True)
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
rouge = evaluate.load("rouge")

This works fine. But, when I proceed to run the training using below code :

training_args = Seq2SeqTrainingArguments(
    output_dir = "textSummary",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True, 
    fp16=True, 
    #report_to="wandb",
    #run_name="text_summary_gpt2-medium" 
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

I get an error as follows :

ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token (tokenizer.pad_token = tokenizer.eos_token e.g.) or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'}).

How to resolve this error ? Also, if there is any resource, where I can study this in detail, please provide the link to the same.

brewmaster321 · March 10, 2023, 2:37pm

Not completely sure, but it looks like you need to add a pad token to the tokenizer, something like

tokenizer = AutoTokenizer.from_pretrained(“gpt2-medium”, model_max_length = 1024)
tokenizer.add_special_tokens({‘pad_token’: ‘[PAD]’})
There are probably other changes needed as well to adapt from T5 to GPT-2, will see if I can take a look at this.

juusohugs · March 17, 2023, 3:16pm

there’s a mentioning that you could simply add “TL;DR” at the end of the input text to get summaries. I didn’t try it, but check it out

Although trained as an auto-regressive language model, you can make GPT-2 generate summaries by appending “TL;DR” at the end of the input text.

Please notice that GPT-2 is not encoder-decoder so the architecture is not possibly the best one for generating summaries. And for same reason I think you cannot use Seq2Seq trainer or datacollator for fine tuning.

Here’s one article on fine tuning the model:

…but for summarization you probably are better of by fine tuning for example “Flan-T5” (such as “google/flan-t5-small”) by google. Good luck

Pranavagrl · August 1, 2023, 7:28am

hi, @brewmaster321 Have you taken a look at it? I am also facing the same issue. I have added the pad_token and am now getting a CUDA out-of-memory error after shrinking my dataset as well. and my model size is not much, though.

asharsha30 · November 24, 2024, 3:42am

Hey following up on this topic, are you able to find a way to successfully apply summarization technique using GPT2. All my tries have been unsuccessful so far.

Recent days, I have been experimenting with certain ways to finetune GPT2 for text summarization especially on the datasets like CNN-Daily Mail, XSUM etc. Though I learnt that encoder-decoder models like BERT are always preferred for these tasks over GPT, I would like to figure out a way better to make use of GPT or LLAMA for this task. I am not successful in generating the concise summary and most times the generated summaries simply mimic or copy the text. My Steps: I used <|pos|> and <|sep|> tokens for padding in GPT2 and also used an instructional prefix “Write a concise summary”. Truncated input and output to contain tokens which are no more than 1024 Used LoRA with r=16, alpha=32 and drop rate as 0.1 Applied cross entropy loss with ignore_index=-100 to mask irrelevant tokens Could you share some ways in llama or gpt2 as to how effectively one could approach this. Please could you share some ideas. Also from ‘[https://openai.com/index/fine-tuning-gpt-2/’](javascript:void(0) they mentioned the same case

Show less

Reply

Topic		Replies	Views
Pretraining an MT5 model for summarisation 🤗Transformers	3	522	September 8, 2022
Fine tuning and retokenizing Beginners	0	588	May 29, 2022
GPT2 Training from scratch in German 🤗Transformers	3	2311	October 3, 2020
Finetuning T5 for Summarisation - Poor results Intermediate	1	527	April 28, 2024
Use Pretrained T5 for Summarization Beginners	3	635	July 2, 2021

How to train GPT-2 for text summarization?

Related topics