I have a custom dataset which is not text document.
Already have a pretrained BART model. For this pretraining purpose, the format of the data for BART model is like this:
For a single input:
"input_ids": [code1 code2 sep_token_id mask_token_id sep_token_id code5 sep_token_id code10 .... sep_token_id mask_token_id]
"attention_mask": [1 1 1 1 ... 1 1 1 ] # number of 1s = the length of "input_ids"
"decoder_input_ids": [EOS code1 code2 201 code3 201 code4 201 code5 201 code10 ... 201 code250]
Now want to find tune that model for classification purposes.
For using BartForSequenceClassification model:
I am using the BERTTokenizer to load the vocab file and then using the custom datacollator to generate the proper format of the inputs for these “input_ids”, “attention_mask”, “decoder_input_ids”, “labels”.
I am getting the error on the 2nd line:
eos_mask = input_ids.eq(self.config.eos_token_id)
sentence_representation = x[eos_mask, :].view(x.size(0), -1, x.size(-1))[:, -1, :]
because in my input_ids, I don’t have any eos_token_id.
Could you please tell me what is proper format of the input_ids, decoder_input_ids and others in the case of BartForSequenceClassification?
- is the format same as above?