Tokenization error only for pretrained tokenizer

Hello friends! I’m running into a curious issue. I have a tokenizer that I trained on my corpus and the model that I’m training isn’t doing particularly well – I suspect since I’m fine-tuning a pre-trained model.

So I’ve moved back to a pretrained tokenizer, and I get the following error:
AssertionError: There should be exactly three separator tokens: 2 in every sample for questions answering. You might also consider to set 'global_attention_mask' manually in the forward function to avoid this error.

The model trains fine with the tokenizer I generated, but when I use LongformerTokenizerFast.from_pretrained, I get the above error.

I’m encoding with:

    def convert_to_features(self, item):
        if self.print_text:
            print(f"Input Text: {item['context']}")

        encodings = self.tokenizer("<s>" + item["question"] + "</s>",
                                   "</s>" + item["context"],
                                   truncation="only_second",
                                   max_length=self.max_len,
                                   padding="max_length",
                                   return_offsets_mapping=True,
                                   return_tensors="pt")

        return encodings

My dataset’s __getitem__ is:

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()

        encodings = self.convert_to_features(self.qa_data.iloc[idx])
        offset_mapping = encodings["offset_mapping"]
        if self.qa_data['id'].iloc[idx] not in self.offset_mapping.keys():
            self.offset_mapping[self.qa_data['id'].iloc[idx]] = offset_mapping
        input_ids = encodings["input_ids"]
        seps = self.tokenizer.sep_token_id * torch.ones_like(input_ids[:, -1])
        input_ids[:, -1] = seps
        attention_mask = encodings["attention_mask"]


        sample = {"input_ids": input_ids.squeeze(),
                  "attention_mask": attention_mask.squeeze(),
                  "start_positions": self.qa_data['start_idx'].iloc[idx],
                  "end_positions": self.qa_data['end_idx'].iloc[idx],
                  "question": self.qa_data['question'].iloc[idx],
                  "context": self.qa_data['context'].iloc[idx],
                  "answer": self.qa_data['answer'].iloc[idx],
                  }

        return sample

and when I loop over and count the sep tokens in input_ids.squeeze(), I get three.

Any help would be greatly appreciated.

@egalinkin you found any solution ?