Using Bart for classification task (Fine-Tuning): Input format

I have a custom dataset which is not text document.

Already have a pretrained BART model. For this pretraining purpose, the format of the data for BART model is like this:

For a single input:

"input_ids": [code1 code2 sep_token_id mask_token_id sep_token_id code5 sep_token_id code10 .... sep_token_id mask_token_id]
"attention_mask": [1 1 1 1 ... 1 1 1 ] # number of 1s = the length of "input_ids"
"decoder_input_ids": [EOS code1 code2 201 code3 201 code4 201 code5  201 code10 ... 201 code250]

Now want to find tune that model for classification purposes.

For using BartForSequenceClassification model:
I am using the BERTTokenizer to load the vocab file and then using the custom datacollator to generate the proper format of the inputs for these “input_ids”, “attention_mask”, “decoder_input_ids”, “labels”.

I am getting the error on the 2nd line:

eos_mask = input_ids.eq(self.config.eos_token_id)
sentence_representation = x[eos_mask, :].view(x.size(0), -1, x.size(-1))[:, -1, :]

because in my input_ids, I don’t have any eos_token_id.

Could you please tell me what is proper format of the input_ids, decoder_input_ids and others in the case of BartForSequenceClassification?

  • is the format same as above?

@ArthurZ could you please tell me this is the correct format?

Hey! The eos token can be replace with anything as long as it serves the purpose. For example you can use the sep_token or the mask_token to solve the error.
I am not entirely sure what you are doing and am not a pro at training bert otherwise, would rather go on the courses we have about training encoder-decoders!