Hello,
I have an issue when I using is_split_into_words
flag.
This is a two sentences text. The period ID is different.
import spacy
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('roberta-base', use_fast=True, add_prefix_space=True)
nlp = spacy.load("en_core_web_sm", exclude=["tagger", "parser", "lemmatizer", "ner", "textcat"])
text = "This is a two sentences text. The period ID is different."
spacy_tokenized = [tok.text for tok in nlp(text)]
print(tokenizer(text, add_special_tokens=False)['input_ids'])
> [713, 16, 10, 80, 11305, 2788, 4, 20, 675, 4576, 16, 430, 4]
print(tokenizer(spacy_tokenized, add_special_tokens=False, is_split_into_words=True)['input_ids'])
> [152, 16, 10, 80, 11305, 2788, 479, 20, 675, 4576, 16, 430, 479]
As you can see, the period ID is different, when is_split_into_words=True
the ID is 479 and 4 otherwise.
Now I know that this is expected behavior, the tokenizer adds a space before each word when is_split_into_words=True
so technically these are 2 different entries in the vocabulary.
But I don’t need this extra space when decoding, Any workaround or suggestion to solve it?
Thanks in advance,
Shon