How to train GPT-2 for text summarization?

I have scrapped some data wherein I have some text paragraphs followed by one line summary. I am trying to finetune GPT-2 using this dataset for text summarization. I followed the demo available for text summarization at link - It works perfectly fine, however, uses T5 model. So, I replaced T5 model and corresponding tokenzier with ‘GPT-2 medium’ model and GPT tokenizer. The data preprocessing code I used is exactly the same as the one given for T5 as in the tutorial referenced above. But, it does not work and throws errors.
Below is the code I am using :

# define tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2-medium", model_max_length = 512)

# define model
# model = AutoModelForSeq2SeqLM.from_pretrained("gpt2-medium")
model = GPT2LMHeadModel.from_pretrained('gpt2-medium', pad_token_id=tokenizer.eos_token_id)

# preprocess input
prefix = "summarize: "
def preprocess_function(examples):
  inputs = [ prefix + doc for doc in examples['abstract'] ]
  model_inputs = tokenizer(inputs, max_length=512)

  labels = tokenizer(text_target=examples['title'], max_length=128)
  model_inputs['labels'] = labels['input_ids']
  
  return model_inputs

tokenized_train = train_ds.map(preprocess_function, batched = True )
tokenized_test = test_ds.map(preprocess_function, batched=True)
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
rouge = evaluate.load("rouge")

This works fine. But, when I proceed to run the training using below code :

training_args = Seq2SeqTrainingArguments(
    output_dir = "textSummary",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True, 
    fp16=True, 
    #report_to="wandb",
    #run_name="text_summary_gpt2-medium" 
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

I get an error as follows :

ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token (tokenizer.pad_token = tokenizer.eos_token e.g.) or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'}).

How to resolve this error ? Also, if there is any resource, where I can study this in detail, please provide the link to the same.

Not completely sure, but it looks like you need to add a pad token to the tokenizer, something like

  • tokenizer = AutoTokenizer.from_pretrained(“gpt2-medium”, model_max_length = 1024)
  • tokenizer.add_special_tokens({‘pad_token’: ‘[PAD]’})
    There are probably other changes needed as well to adapt from T5 to GPT-2, will see if I can take a look at this.
1 Like

there’s a mentioning that you could simply add “TL;DR” at the end of the input text to get summaries. I didn’t try it, but check it out

Although trained as an auto-regressive language model, you can make GPT-2 generate summaries by appending “TL;DR” at the end of the input text.

Please notice that GPT-2 is not encoder-decoder so the architecture is not possibly the best one for generating summaries. And for same reason I think you cannot use Seq2Seq trainer or datacollator for fine tuning.

Here’s one article on fine tuning the model:

…but for summarization you probably are better of by fine tuning for example “Flan-T5” (such as “google/flan-t5-small”) by google. Good luck :slight_smile:

hi, @brewmaster321 Have you taken a look at it? I am also facing the same issue. I have added the pad_token and am now getting a CUDA out-of-memory error after shrinking my dataset as well. and my model size is not much, though.