Subword regularization in Sentencepiece and DeBERTaV2 tokenizers (not working)

derVokler · February 1, 2023, 2:01pm

Hi everyone,

I noticed something when I trained a Sentencepiece tokenizer with the Sentencepiece library and then used it in the DebertaV2Tokenizer and DebertaV2TokenizerFast: it seems to me that the subword regularization is not working in the ‘fast’ one, even though according to the documentation, it should.

Maybe someone can confirm this or tell me what I am doing wrong.

Tested with Sentencepiece 0.1.97, transformers 4.26.

Here is how I train the Sentencepiece model:

import sentencepiece as spm

spm.SentencePieceTrainer.Train(
    input=<FILENAME>
    model_prefix='spTest',
    vocab_size=1000,
    pad_id=0,                
    unk_id=1,
    bos_id=2,
    eos_id=3,
    pad_piece='[PAD]',
    unk_piece='[UNK]',
    bos_piece='[CLS]',
    eos_piece='[SEP]',
    user_defined_symbols='[MASK]',
    model_type='unigram'
)

I load the trained model so I can test it, also setting the parameters that control the subword regularization:

sp_processor = spm.SentencePieceProcessor(
    model_file="spTest.model",
    enable_sampling=True,
    nbest_size=-1,
    alpha=0.1,
)

Testing whether it works:

str_example_text = "Let's see whether subword regularization works."

for _ in range(10):
    print( ", ".join( [str(x) for x in sp_processor.encode( str_example_text ) ] ) )

Looks to me like it does:

293, 107, 258, 7, 67, 12, 12, 187, 242, 126, 51, 452, 133, 21, 47, 18, 217, 12, 28, 134, 52, 684, 187, 46, 99, 7, 8
6, 219, 107, 258, 7, 230, 12, 6, 133, 242, 126, 12, 47, 67, 45, 49, 133, 21, 47, 18, 6, 73, 28, 134, 22, 201, 178, 22, 215, 187, 46, 99, 7, 8
293, 12, 23, 258, 7, 6, 7, 12, 12, 6, 133, 242, 23, 242, 47, 452, 133, 21, 47, 18, 6, 73, 28, 134, 52, 31, 178, 22, 215, 6, 133, 46, 99, 7, 8
6, 219, 12, 23, 258, 7, 230, 12, 187, 115, 107, 115, 12, 47, 452, 133, 46, 18, 6, 47, 12, 28, 441, 31, 178, 93, 31, 84, 187, 46, 99, 7, 8
293, 12, 23, 258, 7, 230, 12, 187, 242, 354, 452, 133, 21, 47, 18, 85, 28, 441, 31, 178, 22, 215, 187, 21, 47, 99, 7, 8
293, 107, 258, 7, 6, 245, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 45, 20, 22, 47, 684, 6, 133, 21, 47, 99, 7, 8
293, 12, 23, 258, 7, 6, 7, 12, 12, 187, 115, 12, 354, 452, 133, 46, 18, 6, 73, 28, 441, 31, 178, 93, 31, 21, 42, 6, 133, 21, 47, 99, 7, 8
293, 12, 23, 258, 7, 6, 245, 12, 187, 115, 107, 115, 12, 47, 6, 7, 45, 49, 133, 46, 18, 6, 73, 28, 134, 52, 684, 187, 46, 99, 7, 8
293, 107, 258, 7, 230, 12, 187, 242, 126, 51, 6, 7, 45, 49, 133, 46, 18, 6, 47, 12, 28, 45, 20, 52, 31, 178, 22, 23, 105, 187, 21, 47, 99, 7, 8
293, 107, 258, 7, 6, 245, 12, 187, 115, 107, 115, 12, 47, 67, 45, 49, 133, 21, 47, 18, 6, 73, 28, 45, 20, 22, 47, 31, 178, 22, 23, 31, 84, 187, 21, 47, 99, 7, 8

Now I load the same model into a transformers DebertaV2Tokenizer (same settings for subword regularization) and do the same thing:

from transformers import DebertaV2Tokenizer

tokenizer_deberta = DebertaV2Tokenizer(
    vocab_file  = "spTest.model",
    sp_model_kwargs = {
        'enable_sampling': True,
        'nbest_size': -1,
        'alpha': 0.1
    })

for _ in range(10):
    print( ", ".join( [str(x) for x in tokenizer_deberta.encode( str_example_text ) ] ) )

This also seems to work (it is nice to see the beginning/end of sequence tokens pop up in here):

2, 293, 107, 258, 7, 230, 12, 187, 242, 23, 115, 12, 47, 452, 133, 46, 18, 85, 28, 134, 22, 201, 178, 93, 105, 6, 133, 46, 99, 7, 8, 3
2, 6, 219, 107, 258, 7, 6, 7, 12, 12, 187, 115, 12, 23, 115, 12, 47, 6, 7, 45, 49, 133, 46, 18, 6, 73, 28, 45, 20, 52, 684, 6, 133, 46, 99, 7, 8, 3
2, 6, 219, 107, 258, 7, 230, 12, 6, 133, 115, 107, 115, 51, 6, 7, 45, 49, 133, 46, 18, 6, 47, 12, 28, 441, 31, 178, 22, 23, 31, 84, 187, 21, 47, 99, 7, 8, 3
2, 6, 219, 12, 23, 258, 7, 230, 12, 6, 133, 115, 12, 126, 12, 47, 452, 133, 46, 18, 217, 12, 28, 134, 22, 47, 31, 178, 93, 31, 84, 6, 133, 46, 99, 7, 8, 3
2, 293, 12, 23, 258, 7, 6, 7, 12, 12, 6, 133, 242, 354, 452, 133, 21, 47, 18, 85, 28, 134, 52, 684, 6, 133, 46, 99, 7, 8, 3
2, 293, 12, 23, 258, 7, 67, 12, 12, 187, 115, 12, 354, 452, 133, 46, 18, 6, 73, 28, 45, 20, 52, 31, 178, 93, 31, 21, 42, 187, 21, 47, 99, 7, 8, 3
2, 293, 12, 23, 258, 7, 230, 12, 187, 242, 354, 452, 133, 21, 47, 18, 6, 73, 28, 134, 22, 201, 178, 91, 187, 21, 47, 99, 7, 8, 3
2, 6, 219, 12, 23, 258, 7, 6, 7, 12, 12, 187, 115, 12, 126, 51, 452, 133, 46, 18, 217, 12, 28, 45, 20, 52, 31, 178, 22, 23, 31, 84, 187, 46, 99, 7, 8, 3
2, 293, 107, 258, 7, 230, 12, 187, 115, 12, 354, 6, 7, 45, 49, 133, 46, 18, 6, 73, 28, 134, 22, 201, 178, 22, 23, 105, 6, 133, 21, 47, 99, 7, 8, 3
2, 293, 12, 23, 258, 7, 6, 245, 12, 187, 115, 12, 354, 452, 133, 21, 47, 18, 6, 47, 12, 28, 45, 112, 47, 31, 178, 22, 215, 6, 133, 21, 47, 99, 7, 8, 3

Of course I want to be faster, so I also try the transformer DebertaV2TokenizerFast:

tokenizer_deberta_fast = DebertaV2TokenizerFast(
    vocab_file  = "spTest.model",
    sp_model_kwargs = {
        'enable_sampling': True,
        'nbest_size': -1,
        'alpha': 0.1
    }
)

for _ in range(10):
    print( ", ".join( [str(x) for x in tokenizer_deberta_fast.encode( str_example_text ) ] ) )

And here, it is not working anymore:

2, 293, 107, 258, 7, 230, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 441, 684, 187, 46, 99, 7, 8, 3
2, 293, 107, 258, 7, 230, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 441, 684, 187, 46, 99, 7, 8, 3
2, 293, 107, 258, 7, 230, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 441, 684, 187, 46, 99, 7, 8, 3
2, 293, 107, 258, 7, 230, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 441, 684, 187, 46, 99, 7, 8, 3
2, 293, 107, 258, 7, 230, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 441, 684, 187, 46, 99, 7, 8, 3
2, 293, 107, 258, 7, 230, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 441, 684, 187, 46, 99, 7, 8, 3
2, 293, 107, 258, 7, 230, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 441, 684, 187, 46, 99, 7, 8, 3
2, 293, 107, 258, 7, 230, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 441, 684, 187, 46, 99, 7, 8, 3
2, 293, 107, 258, 7, 230, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 441, 684, 187, 46, 99, 7, 8, 3
2, 293, 107, 258, 7, 230, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 441, 684, 187, 46, 99, 7, 8, 3

Does anyone know whether this is intentional? Here is the snippet from the documentation that makes me think it should work:

sp_model_kwargs (dict, optional) — Will be passed to the SentencePieceProcessor.__init__() method. The Python wrapper for SentencePiece can be used, among other things, to set:

enable_sampling: Enable subword regularization.

nbest_size: Sampling parameters for unigram. Invalid for BPE-Dropout.

nbest_size = {0,1}: No sampling is performed.

nbest_size > 1: samples from the nbest_size results.

nbest_size < 0: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.

alpha: Smoothing parameter for unigram sampling, and dropout probability of merge operations for BPE-dropout.

Thank you all for your help.

Topic		Replies	Views
SentencePiece to Tokenizers conversion 🤗Tokenizers	0	75	March 14, 2025
Training sentencePiece from scratch? 🤗Tokenizers	8	19238	December 19, 2023
SentencePiece user_defined_symbols and fast tokenizers 🤗Tokenizers	1	1572	January 3, 2024
Improving performance of Wav2Vec2 fine tuning with word piece vocabulary Research	5	2994	October 27, 2021
Does Deberta tokenizer use wordpiece? 🤗Tokenizers	0	558	August 6, 2022

Subword regularization in Sentencepiece and DeBERTaV2 tokenizers (not working)

Related topics