Period ID in RobertaTokenizer with is_split_into_words

shon711 · October 8, 2022, 2:23pm

Hello,

I have an issue when I using is_split_into_words flag.

This is a two sentences text. The period ID is different.

import spacy
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('roberta-base', use_fast=True, add_prefix_space=True)
nlp = spacy.load("en_core_web_sm", exclude=["tagger", "parser", "lemmatizer", "ner", "textcat"])

text = "This is a two sentences text. The period ID is different."

spacy_tokenized = [tok.text for tok in nlp(text)]

print(tokenizer(text, add_special_tokens=False)['input_ids'])
> [713, 16, 10, 80, 11305, 2788, 4, 20, 675, 4576, 16, 430, 4]

print(tokenizer(spacy_tokenized, add_special_tokens=False, is_split_into_words=True)['input_ids'])
> [152, 16, 10, 80, 11305, 2788, 479, 20, 675, 4576, 16, 430, 479]

As you can see, the period ID is different, when is_split_into_words=True the ID is 479 and 4 otherwise.

Now I know that this is expected behavior, the tokenizer adds a space before each word when is_split_into_words=True so technically these are 2 different entries in the vocabulary.

But I don’t need this extra space when decoding, Any workaround or suggestion to solve it?

Thanks in advance,
Shon

lianghsun · October 27, 2022, 1:06pm

Hi @shon711, I think the problem may be the Spacy, please take a look at the following result

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('roberta-base',add_prefix_space=True)

text = "This is a two sentences text. The period ID is different."
_text = text.split() # Replace spacy for testing

print(tokenizer(text).input_ids)
print(tokenizer(_text, is_split_into_words=True).input_ids)
# > the results is same!

So, I think you should modify spacy_tokenized to be same format as _text, then you will get the same result

Topic		Replies	Views
Tokenized sequence lengths 🤗Tokenizers	6	2036	March 10, 2022
Convert_tokens_to_ids produces <unk> 🤗Tokenizers	1	4452	October 25, 2022
Roberta pretokenizer - split punctuation? Beginners	2	210	March 30, 2024
Space token ' ' cannot be add when is_split_into_words = True 🤗Tokenizers	1	460	March 11, 2021
Tokenizer splits up pre-split tokens 🤗Tokenizers	9	6646	February 9, 2024

Period ID in RobertaTokenizer with is_split_into_words

Related topics