I have scrapped some data wherein I have some text paragraphs followed by one line summary. I am trying to finetune GPT-2 using this dataset for text summarization. I followed the demo available for text summarization at link - It works perfectly fine, however, uses T5 model. So, I replaced T5 model and corresponding tokenzier with ‘GPT-2 medium’ model and GPT tokenizer. The data preprocessing code I used is exactly the same as the one given for T5 as in the tutorial referenced above. But, it does not work and throws errors.
Below is the code I am using :
# define tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2-medium", model_max_length = 512)
# define model
# model = AutoModelForSeq2SeqLM.from_pretrained("gpt2-medium")
model = GPT2LMHeadModel.from_pretrained('gpt2-medium', pad_token_id=tokenizer.eos_token_id)
# preprocess input
prefix = "summarize: "
def preprocess_function(examples):
inputs = [ prefix + doc for doc in examples['abstract'] ]
model_inputs = tokenizer(inputs, max_length=512)
labels = tokenizer(text_target=examples['title'], max_length=128)
model_inputs['labels'] = labels['input_ids']
return model_inputs
tokenized_train = train_ds.map(preprocess_function, batched = True )
tokenized_test = test_ds.map(preprocess_function, batched=True)
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
rouge = evaluate.load("rouge")
This works fine. But, when I proceed to run the training using below code :
training_args = Seq2SeqTrainingArguments(
output_dir = "textSummary",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
weight_decay=0.01,
save_total_limit=3,
num_train_epochs=3,
predict_with_generate=True,
fp16=True,
#report_to="wandb",
#run_name="text_summary_gpt2-medium"
)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=tokenized_train,
eval_dataset=tokenized_test,
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
trainer.train()
I get an error as follows :
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as
pad_token
(tokenizer.pad_token = tokenizer.eos_token e.g.)
or add a new pad token viatokenizer.add_special_tokens({'pad_token': '[PAD]'})
.
How to resolve this error ? Also, if there is any resource, where I can study this in detail, please provide the link to the same.