Tokenization error only for pretrained tokenizer

egalinkin · November 12, 2021, 2:26pm

Hello friends! I’m running into a curious issue. I have a tokenizer that I trained on my corpus and the model that I’m training isn’t doing particularly well – I suspect since I’m fine-tuning a pre-trained model.

So I’ve moved back to a pretrained tokenizer, and I get the following error:
AssertionError: There should be exactly three separator tokens: 2 in every sample for questions answering. You might also consider to set 'global_attention_mask' manually in the forward function to avoid this error.

The model trains fine with the tokenizer I generated, but when I use LongformerTokenizerFast.from_pretrained, I get the above error.

I’m encoding with:

    def convert_to_features(self, item):
        if self.print_text:
            print(f"Input Text: {item['context']}")

        encodings = self.tokenizer("<s>" + item["question"] + "</s>",
                                   "</s>" + item["context"],
                                   truncation="only_second",
                                   max_length=self.max_len,
                                   padding="max_length",
                                   return_offsets_mapping=True,
                                   return_tensors="pt")

        return encodings

My dataset’s __getitem__ is:

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()

        encodings = self.convert_to_features(self.qa_data.iloc[idx])
        offset_mapping = encodings["offset_mapping"]
        if self.qa_data['id'].iloc[idx] not in self.offset_mapping.keys():
            self.offset_mapping[self.qa_data['id'].iloc[idx]] = offset_mapping
        input_ids = encodings["input_ids"]
        seps = self.tokenizer.sep_token_id * torch.ones_like(input_ids[:, -1])
        input_ids[:, -1] = seps
        attention_mask = encodings["attention_mask"]


        sample = {"input_ids": input_ids.squeeze(),
                  "attention_mask": attention_mask.squeeze(),
                  "start_positions": self.qa_data['start_idx'].iloc[idx],
                  "end_positions": self.qa_data['end_idx'].iloc[idx],
                  "question": self.qa_data['question'].iloc[idx],
                  "context": self.qa_data['context'].iloc[idx],
                  "answer": self.qa_data['answer'].iloc[idx],
                  }

        return sample

and when I loop over and count the sep tokens in input_ids.squeeze(), I get three.

Any help would be greatly appreciated.

jaideepcs · September 24, 2023, 10:14am

@egalinkin you found any solution ?

Topic		Replies	Views
Chapter 3 problem Course	2	540	July 26, 2021
Cannot create an identical PretrainedTokenizerFast object from a Tokenizer created by tokenizers library 🤗Tokenizers	1	1090	August 30, 2021
Help defining tokenizer 🤗Tokenizers	0	282	April 28, 2023
Can't load pre-trained tokenizer with additional new tokens 🤗Transformers	3	4423	August 10, 2021
Why do I get this error running tokenizer? Beginners	6	17611	August 20, 2020

Tokenization error only for pretrained tokenizer

Related topics