Period ID in RobertaTokenizer with is_split_into_words

Hello,

I have an issue when I using is_split_into_words flag.

This is a two sentences text. The period ID is different.

import spacy
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('roberta-base', use_fast=True, add_prefix_space=True)
nlp = spacy.load("en_core_web_sm", exclude=["tagger", "parser", "lemmatizer", "ner", "textcat"])

text = "This is a two sentences text. The period ID is different."

spacy_tokenized = [tok.text for tok in nlp(text)]

print(tokenizer(text, add_special_tokens=False)['input_ids'])
> [713, 16, 10, 80, 11305, 2788, 4, 20, 675, 4576, 16, 430, 4]

print(tokenizer(spacy_tokenized, add_special_tokens=False, is_split_into_words=True)['input_ids'])
> [152, 16, 10, 80, 11305, 2788, 479, 20, 675, 4576, 16, 430, 479]

As you can see, the period ID is different, when is_split_into_words=True the ID is 479 and 4 otherwise.

Now I know that this is expected behavior, the tokenizer adds a space before each word when is_split_into_words=True so technically these are 2 different entries in the vocabulary.

But I don’t need this extra space when decoding, Any workaround or suggestion to solve it?

Thanks in advance,
Shon

Hi @shon711, I think the problem may be the Spacy, please take a look at the following result

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('roberta-base',add_prefix_space=True)

text = "This is a two sentences text. The period ID is different."
_text = text.split() # Replace spacy for testing

print(tokenizer(text).input_ids)
print(tokenizer(_text, is_split_into_words=True).input_ids)
# > the results is same!

So, I think you should modify spacy_tokenized to be same format as _text, then you will get the same result :hugs: