Hello friends! I’m running into a curious issue. I have a tokenizer that I trained on my corpus and the model that I’m training isn’t doing particularly well – I suspect since I’m fine-tuning a pre-trained model.
So I’ve moved back to a pretrained tokenizer, and I get the following error:
AssertionError: There should be exactly three separator tokens: 2 in every sample for questions answering. You might also consider to set 'global_attention_mask' manually in the forward function to avoid this error.
The model trains fine with the tokenizer I generated, but when I use LongformerTokenizerFast.from_pretrained
, I get the above error.
I’m encoding with:
def convert_to_features(self, item):
if self.print_text:
print(f"Input Text: {item['context']}")
encodings = self.tokenizer("<s>" + item["question"] + "</s>",
"</s>" + item["context"],
truncation="only_second",
max_length=self.max_len,
padding="max_length",
return_offsets_mapping=True,
return_tensors="pt")
return encodings
My dataset’s __getitem__
is:
def __getitem__(self, idx):
if torch.is_tensor(idx):
idx = idx.tolist()
encodings = self.convert_to_features(self.qa_data.iloc[idx])
offset_mapping = encodings["offset_mapping"]
if self.qa_data['id'].iloc[idx] not in self.offset_mapping.keys():
self.offset_mapping[self.qa_data['id'].iloc[idx]] = offset_mapping
input_ids = encodings["input_ids"]
seps = self.tokenizer.sep_token_id * torch.ones_like(input_ids[:, -1])
input_ids[:, -1] = seps
attention_mask = encodings["attention_mask"]
sample = {"input_ids": input_ids.squeeze(),
"attention_mask": attention_mask.squeeze(),
"start_positions": self.qa_data['start_idx'].iloc[idx],
"end_positions": self.qa_data['end_idx'].iloc[idx],
"question": self.qa_data['question'].iloc[idx],
"context": self.qa_data['context'].iloc[idx],
"answer": self.qa_data['answer'].iloc[idx],
}
return sample
and when I loop over and count the sep tokens in input_ids.squeeze()
, I get three.
Any help would be greatly appreciated.