Train Bart for Conditional Generation (e.g. Summarization)

lvwerra · November 5, 2020, 3:45pm

Hi everybody

I ran into some issues when trying to fine-tune bart for summarization using the BartForConditionalGeneration model. The issue evolved around properly masking and ignoring the padding tokens when training. Without the following fix the loss went down but the model produced bad summaries. I post the solution here in case anyone else runs into similar problems.

from transformers import BartTokenizer, BartForConditionalGeneration
from transformers import Trainer, TrainingArguments
from transformers.modeling_bart import shift_tokens_right

dataset = ... # some Datasets object with train/validation split and columns 'text' and 'summary'
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large')

def convert_to_features(example_batch):
    input_encodings = tokenizer.batch_encode_plus(example_batch['text'], pad_to_max_length=True, max_length=1024, truncation=True))
    target_encodings = tokenizer.batch_encode_plus(example_batch['summary'], pad_to_max_length=True, max_length=1024, truncation=True))
    
    labels = target_encodings['input_ids']
    decoder_input_ids = shift_tokens_right(labels, model.config.pad_token_id)
    labels[labels[:, :] == model.config.pad_token_id] = -100
    
    encodings = {
        'input_ids': input_encodings['input_ids'],
        'attention_mask': input_encodings['attention_mask'],
        'decoder_input_ids': decoder_input_ids,
        'labels': labels,
    }

    return encodings

dataset = dataset.map(convert_to_features, batched=True)
columns = ['input_ids', 'labels', 'decoder_input_ids','attention_mask',] 
dataset.set_format(type='torch', columns=columns)

training_args = TrainingArguments(
    output_dir='./models/bart-summarizer',          
    num_train_epochs=1,           
    per_device_train_batch_size=1, 
    per_device_eval_batch_size=1,   
    warmup_steps=500,               
    weight_decay=0.01,              
    logging_dir='./logs',          
)

trainer = Trainer(
    model=model,                       
    args=training_args,                  
    train_dataset=dataset['train'],        
    eval_dataset=dataset['validation']   
)

The convert_to_features function makes sure that the decoder inputs are correctly shifted and still include the padding tokens while in the labels the padding tokens are replaced by -100 such they are ignored in the model loss.

Jeremias · January 18, 2021, 1:30pm

Hi,
I tried to use your fix but I’m wondering whether this is still up-to-date:
transformers.modeling_bart doesn’t exist. There is only transformers.models.bart.modeling_bart but its shift_tokens_right function requires a torch.Tensor object while huggingface’s datasets object only consists of lists (plus it needs an additional decoder_start_token_id). This also leads to an error in labels[labels[:, :] == model.config.pad_token_id] = -100 because this is numpy syntax.
Also, batch_encode_plus is deprecated.

It would be nice if there was a statement from huggingface if this is still necessary and if not, could it please be removed from the BART page?

valhalla · January 19, 2021, 5:57am

Hi @Jeremias

In recent versions all models now live under their own dir, so bart is now in models.bart

huggingface’s datasets object only consists of lists

datasets can return any type (list, numpy array, torch tensor, tf tensor), by default it returns list, you need to explicitly set the format for it to return tensors, it’s explained in the datasets intro colab,

also, you won’t need to manually call shift_tokens_right to prepare decoder_input_ids, if you just pass labels the model will prepare the decoder_input_ids by correctly shifting them.

We have Bart training examples in examples/seq2seq here , which should help you fine-tune bart.

Hope this helps.

Kusuma · May 16, 2021, 12:55pm

HI @valhalla ,

I am searching for examples to train Bart and not unable to access the link mentioned here. Could you please share the latest one.

Thanks

niksqwerty · September 16, 2021, 4:57am

lvwerra:

training_args = TrainingArguments(
    output_dir='./models/bart-summarizer',          
    num_train_epochs=1,           
    per_device_train_batch_size=1, 
    per_device_eval_batch_size=1,   
    warmup_steps=500,               
    weight_decay=0.01,              
    logging_dir='./logs',          
)

As per the documentation, labels field is only for “Labels for computing the masked language modeling loss.”, Not specifying ‘labels’ doesn’t provide the loss values, how to compute loss if I want to train only for text summarization?

I am sending encoder_input_ids as paragraph, decoder_input_ids as summary also adding attention_mask_target for both.

Deema · November 8, 2021, 8:58pm

Hi,
ellipsis’ object has no attribute ‘map’,
what to do in this case?
best,

Zhengyao · December 5, 2021, 7:00am

You can ignore the specifying labels index when initial the loss function.

vikasy95 · June 27, 2022, 8:01pm

CC’ing @lvwerra @Jeremias
Just checking if anyone can post an example of the dataset.
The above example says:
dataset = … # some Datasets object with train/validation split and columns ‘text’ and ‘summary’

so do we have to provide dataset = {‘text’:[list of all the text,…], ‘summary’:[list of all the corrsponding summaries]}

Thanks,
-V

lvwerra · June 28, 2022, 8:28am

You could for example use the DailyMail/CNN dataset: cnn_dailymail · Datasets at Hugging Face

from datasets import load_dataset

dataset = load_dataset("cnn_dailymail")

Note that the columns in this dataset are called "article" and "highlights".

Tiezheng · October 25, 2022, 10:43pm

Hi @lvwerra Can you remember how bad is it without the preprocessing you post? For example, how many points of ROUGE-1 drop?

lvwerra · November 22, 2022, 3:31pm

I cannot, sorry Maybe 5-10% but I am just guessing.

Tiezheng · November 22, 2022, 5:45pm

Thanks for sharing

maciejskorski · June 27, 2023, 9:29pm

The shared example doesn’t work anymore (not because of typos). I think this adapted code snippet can do the job (but review params before use carefully!)

from transformers import AutoTokenizer, BartForConditionalGeneration
from transformers.generation import GenerationConfig
from transformers import Trainer, TrainingArguments
from transformers.models.bart.modeling_bart import shift_tokens_right
from transformers import DataCollatorForSeq2Seq

model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn", attention_dropout=0.1)
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

def get_features(batch):
    input_encodings = tokenizer(batch["text"], max_length=1024, truncation=True)
    
    with tokenizer.as_target_tokenizer():
        target_encodings = tokenizer(batch["summary"], max_length=256, truncation=True)
        
    return {"input_ids": input_encodings["input_ids"], 
           "attention_mask": input_encodings["attention_mask"], 
           "labels": target_encodings["input_ids"]}

dataset_ftrs = dataset.map(get_features, batched=True)
columns = ['input_ids', 'labels', 'input_ids','attention_mask',] 
dataset_ftrs.set_format(type='torch', columns=columns)

training_args = TrainingArguments(
    output_dir='./models/bart-summarizer',          
    num_train_epochs=1,           
    per_device_train_batch_size=1, 
    per_device_eval_batch_size=1,   
    warmup_steps=500,               
    weight_decay=0.01,              
    logging_dir='./logs',          
)

model.config.output_attentions = True
model.config.output_hidden_states = True

training_args = TrainingArguments(
    output_dir='./models/bart-summarizer', 
    num_train_epochs=1, 
    warmup_steps=500,                                  
    per_device_train_batch_size=1, 
    per_device_eval_batch_size=1, 
    weight_decay=0.01, 
    logging_steps=10, 
    push_to_hub=False, 
    evaluation_strategy='steps', 
    eval_steps=500, 
    save_steps=1e6, 
    gradient_accumulation_steps=16,
)

trainer = Trainer(
    model=model, 
    args=training_args, 
    tokenizer=tokenizer,                  
    data_collator=seq2seq_data_collator,                  
    train_dataset=dataset_ftrs["train"],                  
    eval_dataset=dataset_ftrs["test"],
)

trainer.train()

Raghavender · October 4, 2023, 5:14am

Hi

I tried to use above code but facing below error. Can someone advise

Below part of the code is raising error

dataset_ftrs = dataset.map(get_features, batched=True)

Error details:
Map: 0%
0/287113 [00:00<?, ? examples/s]

KeyError Traceback (most recent call last)
in <cell line: 1>()
----> 1 dataset_ftrs = dataset.map(get_features, batched=True)

8 frames
/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py in getitem(self, key)
268
269 def getitem(self, key):
→ 270 value = self.data[key]
271 if key in self.keys_to_format:
272 value = self.format(key)

KeyError: ‘text’

lele · November 22, 2023, 8:51am

Hello! In the code above, it seems that you did not apply padding to the source and target. What difference would it make compared to the results when padding is used?

Topic		Replies	Views
Inference/prediction ValueError using BART 🤗Transformers	0	317	April 17, 2022
Fine-Tune BART using "Fine-Tuning Custom Datasets" doc Beginners	6	9403	October 28, 2020
[HELP]Bart summarization output exactly the same as labels 🤗Transformers	3	863	August 4, 2021
Question regarding training of BartForConditionalGeneration Models	1	2039	March 2, 2021
Pretraining BART for conditional generation 🤗Transformers	1	1015	May 30, 2022

Train Bart for Conditional Generation (e.g. Summarization)

Error details: Map: 0% 0/287113 [00:00<?, ? examples/s]

Related topics

Error details:
Map: 0%
0/287113 [00:00<?, ? examples/s]