Using Bart for classification task (Fine-Tuning): Input format

I have a custom dataset which is not text document.

Already have a pretrained BART model. For this pretraining purpose, the format of the data for BART model is like this:

For a single input:

"input_ids": [code1 code2 sep_token_id mask_token_id sep_token_id code5 sep_token_id code10 .... sep_token_id mask_token_id]
"attention_mask": [1 1 1 1 ... 1 1 1 ] # number of 1s = the length of "input_ids"
"decoder_input_ids": [EOS code1 code2 201 code3 201 code4 201 code5  201 code10 ... 201 code250]

Now want to find tune that model for classification purposes.

For using BartForSequenceClassification model:
I am using the BERTTokenizer to load the vocab file and then using the custom datacollator to generate the proper format of the inputs for these “input_ids”, “attention_mask”, “decoder_input_ids”, “labels”.

I am getting the error on the 2nd line:

eos_mask = input_ids.eq(self.config.eos_token_id)
sentence_representation = x[eos_mask, :].view(x.size(0), -1, x.size(-1))[:, -1, :]

because in my input_ids, I don’t have any eos_token_id.

Could you please tell me what is proper format of the input_ids, decoder_input_ids and others in the case of BartForSequenceClassification?

  • is the format same as above?