Train Bart for Conditional Generation (e.g. Summarization)

Hi everybody

I ran into some issues when trying to fine-tune bart for summarization using the BartForConditionalGeneration model. The issue evolved around properly masking and ignoring the padding tokens when training. Without the following fix the loss went down but the model produced bad summaries. I post the solution here in case anyone else runs into similar problems.

from transformers import BartTokenizer, BartForConditionalGeneration
from transformers import Trainer, TrainingArguments
from transformers.modeling_bart import shift_tokens_right

dataset = ... # some Datasets object with train/validation split and columns 'text' and 'summary'
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large')

def convert_to_features(example_batch):
    input_encodings = tokenizer.batch_encode_plus(example_batch['text'], pad_to_max_length=True, max_length=1024, truncation=True))
    target_encodings = tokenizer.batch_encode_plus(example_batch['summary'], pad_to_max_length=True, max_length=1024, truncation=True))
    
    labels = target_encodings['input_ids']
    decoder_input_ids = shift_tokens_right(labels, model.config.pad_token_id)
    labels[labels[:, :] == model.config.pad_token_id] = -100
    
    encodings = {
        'input_ids': input_encodings['input_ids'],
        'attention_mask': input_encodings['attention_mask'],
        'decoder_input_ids': decoder_input_ids,
        'labels': labels,
    }

    return encodings

dataset = dataset.map(convert_to_features, batched=True)
columns = ['input_ids', 'labels', 'decoder_input_ids','attention_mask',] 
dataset.set_format(type='torch', columns=columns)

training_args = TrainingArguments(
    output_dir='./models/bart-summarizer',          
    num_train_epochs=1,           
    per_device_train_batch_size=1, 
    per_device_eval_batch_size=1,   
    warmup_steps=500,               
    weight_decay=0.01,              
    logging_dir='./logs',          
)

trainer = Trainer(
    model=model,                       
    args=training_args,                  
    train_dataset=dataset['train'],        
    eval_dataset=dataset['validation']   
)

The convert_to_features function makes sure that the decoder inputs are correctly shifted and still include the padding tokens while in the labels the padding tokens are replaced by -100 such they are ignored in the model loss.

8 Likes

Hi,
I tried to use your fix but I’m wondering whether this is still up-to-date:
transformers.modeling_bart doesn’t exist. There is only transformers.models.bart.modeling_bart but its shift_tokens_right function requires a torch.Tensor object while huggingface’s datasets object only consists of lists (plus it needs an additional decoder_start_token_id). This also leads to an error in labels[labels[:, :] == model.config.pad_token_id] = -100 because this is numpy syntax.
Also, batch_encode_plus is deprecated.

It would be nice if there was a statement from huggingface if this is still necessary and if not, could it please be removed from the BART page?

Hi @Jeremias

In recent versions all models now live under their own dir, so bart is now in models.bart

huggingface’s datasets object only consists of lists

datasets can return any type (list, numpy array, torch tensor, tf tensor), by default it returns list, you need to explicitly set the format for it to return tensors, it’s explained in the datasets intro colab,

also, you won’t need to manually call shift_tokens_right to prepare decoder_input_ids, if you just pass labels the model will prepare the decoder_input_ids by correctly shifting them.

We have Bart training examples in examples/seq2seq here , which should help you fine-tune bart.

Hope this helps.

HI @valhalla ,

I am searching for examples to train Bart and not unable to access the link mentioned here. Could you please share the latest one.

Thanks

1 Like

As per the documentation, labels field is only for “Labels for computing the masked language modeling loss.”, Not specifying ‘labels’ doesn’t provide the loss values, how to compute loss if I want to train only for text summarization?

I am sending encoder_input_ids as paragraph, decoder_input_ids as summary also adding attention_mask_target for both.

Hi,
ellipsis’ object has no attribute ‘map’,
what to do in this case?
best,

You can ignore the specifying labels index when initial the loss function.

CC’ing @lvwerra @Jeremias
Just checking if anyone can post an example of the dataset.
The above example says:
dataset = … # some Datasets object with train/validation split and columns ‘text’ and ‘summary’

so do we have to provide dataset = {‘text’:[list of all the text,…], ‘summary’:[list of all the corrsponding summaries]}

Thanks,
-V

You could for example use the DailyMail/CNN dataset: cnn_dailymail ¡ Datasets at Hugging Face

from datasets import load_dataset

dataset = load_dataset("cnn_dailymail")

Note that the columns in this dataset are called "article" and "highlights".

Hi @lvwerra Can you remember how bad is it without the preprocessing you post? For example, how many points of ROUGE-1 drop?

I cannot, sorry :slight_smile: Maybe 5-10% but I am just guessing.

1 Like

Thanks for sharing :laughing:

The shared example doesn’t work anymore (not because of typos). I think this adapted code snippet can do the job (but review params before use carefully!)

from transformers import AutoTokenizer, BartForConditionalGeneration
from transformers.generation import GenerationConfig
from transformers import Trainer, TrainingArguments
from transformers.models.bart.modeling_bart import shift_tokens_right
from transformers import DataCollatorForSeq2Seq

model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn", attention_dropout=0.1)
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

def get_features(batch):
    input_encodings = tokenizer(batch["text"], max_length=1024, truncation=True)
    
    with tokenizer.as_target_tokenizer():
        target_encodings = tokenizer(batch["summary"], max_length=256, truncation=True)
        
    return {"input_ids": input_encodings["input_ids"], 
           "attention_mask": input_encodings["attention_mask"], 
           "labels": target_encodings["input_ids"]}

dataset_ftrs = dataset.map(get_features, batched=True)
columns = ['input_ids', 'labels', 'input_ids','attention_mask',] 
dataset_ftrs.set_format(type='torch', columns=columns)

training_args = TrainingArguments(
    output_dir='./models/bart-summarizer',          
    num_train_epochs=1,           
    per_device_train_batch_size=1, 
    per_device_eval_batch_size=1,   
    warmup_steps=500,               
    weight_decay=0.01,              
    logging_dir='./logs',          
)

model.config.output_attentions = True
model.config.output_hidden_states = True

training_args = TrainingArguments(
    output_dir='./models/bart-summarizer', 
    num_train_epochs=1, 
    warmup_steps=500,                                  
    per_device_train_batch_size=1, 
    per_device_eval_batch_size=1, 
    weight_decay=0.01, 
    logging_steps=10, 
    push_to_hub=False, 
    evaluation_strategy='steps', 
    eval_steps=500, 
    save_steps=1e6, 
    gradient_accumulation_steps=16,
)

trainer = Trainer(
    model=model, 
    args=training_args, 
    tokenizer=tokenizer,                  
    data_collator=seq2seq_data_collator,                  
    train_dataset=dataset_ftrs["train"],                  
    eval_dataset=dataset_ftrs["test"],
)

trainer.train()

Hi

I tried to use above code but facing below error. Can someone advise

Below part of the code is raising error

dataset_ftrs = dataset.map(get_features, batched=True)

Error details:
Map: 0%
0/287113 [00:00<?, ? examples/s]

KeyError Traceback (most recent call last)
in <cell line: 1>()
----> 1 dataset_ftrs = dataset.map(get_features, batched=True)

8 frames
/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py in getitem(self, key)
268
269 def getitem(self, key):
→ 270 value = self.data[key]
271 if key in self.keys_to_format:
272 value = self.format(key)

KeyError: ‘text’

Hello! In the code above, it seems that you did not apply padding to the source and target. What difference would it make compared to the results when padding is used?