Question regarding training of BartForConditionalGeneration

Hello Guys,

I am trying to fine-tune the BART summarization model but due to the lack of big dataset, having some difficulties with the fine-tuning.

Thus, I decided to look at the trainig process of BartForConditionalGeneration model in detail. I came across this article titled ‘Introducing BART’ (sorry, only 2 links allowed for new users :neutral_face:) from one of the engineers, @sshleifer, at HuggingFace. It says that BartModel was directly fine-tuned for the summarisation task without any new randomly initialized heads.

My question is about this fine-tuning process, especially on CNN-DailyMail dataset. Do you guys fine-tune the entire Bart model or only the decoder or something else?

I looked at the example fine-tuning script provided on the GitHub but I didn’t find anything related to freezing some part of the model.


I also tried to look at the source code of the BartForConditionalGeneration model and observed the following -

Its just adds a linear layer on top of the BartModel (copy-pasting the __init__ code here for quick reference).

self.model = BartModel(config)
self.register_buffer("final_logits_bias", torch.zeros((1, self.model.shared.num_embeddings)))
self.lm_head = nn.Linear(config.d_model, self.model.shared.num_embeddings, bias=False)

At first, I thought these are the new parameters that are being introduced and thus, being trained. Therefore, I tried the following code to check the number of trainable parameters while keeping the endoer and decoder fixed -

from transformers import BartModel, BartForConditionalGeneration, BartTokenizer

def freeze_params(model):
    for par in model.parameters():
        par.requires_grad = False

model_sum = BartForConditionalGeneration.from_pretrained('facebook/bart-large')
freeze_params(model_sum.get_encoder()) ## freeze the encoder
freeze_params(model_sum.get_decoder()) ## freeze the decoder 

model_sum.train() ## set the train mode
train_p = [p for p in model_sum.parameters() if p.requires_grad] ## get the trainable params
print(f'Length of train params in Summarization Model : {len(train_p)}')

But this code shows that the list is empty. One thing I can do is to explictly set the requires_grad=True for the paramters in the model_sum.lm_head and only fine-tune these parameters. But I am curious to understand the original training/fine-tuning process. It would be of great help to me if you guys could answer my question.


I answered on github: Question regarding training of BartForConditionalGeneration · Issue #10479 · huggingface/transformers · GitHub

1 Like