Unmasking adds an extra whitespace for BPE tokenizer

Eghbal · January 14, 2024, 5:04pm

I created a custom BPE tokenizer for pre-training a Roberta model, utilizing the following parameters (I tried to align it with the default parameters of BPE for RoBERTa.):

from tokenizers.models import BPE
from tokenizers import ByteLevelBPETokenizer
from tokenizers.processors import RobertaProcessing
    
tokenizer = ByteLevelBPETokenizer()
tokenizer.normalizer = normalizers.BertNormalizer(lowercase = False)
tokenizer.train_from_iterator(Data_full, vocab_size = 50264, min_frequency = 2, special_tokens = ["<s>", "<pad>", "</s>", "<unk>"])
tokenizer.add_special_tokens(["<mask>"])
tokenizer.post_processor = RobertaProcessing(sep = ("</s>", 2), cls = ("<s>", 0), trim_offsets = False, add_prefix_space = False)
tokenizer.enable_padding(direction = 'right', pad_id = 1, pad_type_id = 1, pad_token = "<pad>", length = 512)

When pre-training a Roberta model with this tokenizer, I observe unusual behavior during the unmasking process:

from tokenizers import Tokenizer
from transformers import pipeline
from transformers import RobertaTokenizerFast
tokenizer_in = Tokenizer.from_file('tokenizer_file')
tokenizer_m = RobertaTokenizerFast(tokenizer_object=tokenizer_in, clean_up_tokenization_spaces=True) 
unmasker = pipeline('fill-mask', model=model_m, tokenizer = tokenizer_m)
unmasker("Capital of France is <mask>.")

The output consistently appears as follows: Capital of France is(two whitespaces)Paris. I’m curious about the persistent extra space before ‘Paris’. I believe activating the clean_up_tokenization_spaces option might resolve this. Could there be an error in my code leading to this issue? This happens for all unmasking tasks. Also, when I conduct a test with a command like unmasker("Capital of France is<mask>."), the quality improves and the issue seems to be resolved.

Topic		Replies	Views
Creating a custom tokenizer for Roberta Beginners	5	4322	August 1, 2021
Pipeline fill-mask error with custom Roberta tokenizer Beginners	1	847	February 8, 2022
BPEDecoder no spaces after special tokens Intermediate	4	2049	April 19, 2023
RobertaTokenizer: How to enable masking of custom special tokens 🤗Transformers	1	978	April 24, 2021
[URGENT] Issues with Training RoBERTa Model for Text Prediction with Fill Mask Task 🤗Transformers	6	216	March 19, 2024

Unmasking adds an extra whitespace for BPE tokenizer

Related topics