Train Bart for Conditional Generation (e.g. Summarization)

Hi everybody

I ran into some issues when trying to fine-tune bart for summarization using the BartForConditionalGeneration model. The issue evolved around properly masking and ignoring the padding tokens when training. Without the following fix the loss went down but the model produced bad summaries. I post the solution here in case anyone else runs into similar problems.

from transformers import BartTokenizer, BartForConditionalGeneration
from transformers import Trainer, TrainingArguments
from transformers.modeling_bart import shift_tokens_right

dataset = ... # some Datasets object with train/validation split and columns 'text' and 'summary'
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large')

def convert_to_features(example_batch):
    input_encodings = tokenizer.batch_encode_plus(example_batch['text'], pad_to_max_length=True, max_length=1024, truncation=True))
    target_encodings = tokenizer.batch_encode_plus(example_batch['summary'], pad_to_max_length=True, max_length=1024, truncation=True))
    
    labels = target_encodings['input_ids']
    decoder_input_ids = shift_tokens_right(labels, model.config.pad_token_id)
    labels[labels[:, :] == model.config.pad_token_id] = -100
    
    encodings = {
        'input_ids': input_encodings['input_ids'],
        'attention_mask': input_encodings['attention_mask'],
        'decoder_input_ids': decoder_input_ids,
        'labels': labels,
    }

    return encodings

dataset = dataset.map(convert_to_features, batched=True)
columns = ['input_ids', 'labels', 'decoder_input_ids','attention_mask',] 
dataset.set_format(type='torch', columns=columns)

training_args = TrainingArguments(
    output_dir='./models/bart-summarizer',          
    num_train_epochs=1,           
    per_device_train_batch_size=1, 
    per_device_eval_batch_size=1,   
    warmup_steps=500,               
    weight_decay=0.01,              
    logging_dir='./logs',          
)

trainer = Trainer(
    model=model,                       
    args=training_args,                  
    train_dataset=dataset['train'],        
    eval_dataset=dataset['validation']   
)

The convert_to_features function makes sure that the decoder inputs are correctly shifted and still include the padding tokens while in the labels the padding tokens are replaced by -100 such they are ignored in the model loss.

4 Likes

Hi,
I tried to use your fix but I’m wondering whether this is still up-to-date:
transformers.modeling_bart doesn’t exist. There is only transformers.models.bart.modeling_bart but its shift_tokens_right function requires a torch.Tensor object while huggingface’s datasets object only consists of lists (plus it needs an additional decoder_start_token_id). This also leads to an error in labels[labels[:, :] == model.config.pad_token_id] = -100 because this is numpy syntax.
Also, batch_encode_plus is deprecated.

It would be nice if there was a statement from huggingface if this is still necessary and if not, could it please be removed from the BART page?

Hi @Jeremias

In recent versions all models now live under their own dir, so bart is now in models.bart

huggingface’s datasets object only consists of lists

datasets can return any type (list, numpy array, torch tensor, tf tensor), by default it returns list, you need to explicitly set the format for it to return tensors, it’s explained in the datasets intro colab,

also, you won’t need to manually call shift_tokens_right to prepare decoder_input_ids, if you just pass labels the model will prepare the decoder_input_ids by correctly shifting them.

We have Bart training examples in examples/seq2seq here , which should help you fine-tune bart.

Hope this helps.