Subword regularization in Sentencepiece and DeBERTaV2 tokenizers (not working)

Hi everyone,

I noticed something when I trained a Sentencepiece tokenizer with the Sentencepiece library and then used it in the DebertaV2Tokenizer and DebertaV2TokenizerFast: it seems to me that the subword regularization is not working in the ā€˜fastā€™ one, even though according to the documentation, it should.

Maybe someone can confirm this or tell me what I am doing wrong.

Tested with Sentencepiece 0.1.97, transformers 4.26.

Here is how I train the Sentencepiece model:

import sentencepiece as spm

spm.SentencePieceTrainer.Train(
    input=<FILENAME>
    model_prefix='spTest',
    vocab_size=1000,
    pad_id=0,                
    unk_id=1,
    bos_id=2,
    eos_id=3,
    pad_piece='[PAD]',
    unk_piece='[UNK]',
    bos_piece='[CLS]',
    eos_piece='[SEP]',
    user_defined_symbols='[MASK]',
    model_type='unigram'
)

I load the trained model so I can test it, also setting the parameters that control the subword regularization:

sp_processor = spm.SentencePieceProcessor(
    model_file="spTest.model",
    enable_sampling=True,
    nbest_size=-1,
    alpha=0.1,
)

Testing whether it works:

str_example_text = "Let's see whether subword regularization works."

for _ in range(10):
    print( ", ".join( [str(x) for x in sp_processor.encode( str_example_text ) ] ) )

Looks to me like it does:

293, 107, 258, 7, 67, 12, 12, 187, 242, 126, 51, 452, 133, 21, 47, 18, 217, 12, 28, 134, 52, 684, 187, 46, 99, 7, 8
6, 219, 107, 258, 7, 230, 12, 6, 133, 242, 126, 12, 47, 67, 45, 49, 133, 21, 47, 18, 6, 73, 28, 134, 22, 201, 178, 22, 215, 187, 46, 99, 7, 8
293, 12, 23, 258, 7, 6, 7, 12, 12, 6, 133, 242, 23, 242, 47, 452, 133, 21, 47, 18, 6, 73, 28, 134, 52, 31, 178, 22, 215, 6, 133, 46, 99, 7, 8
6, 219, 12, 23, 258, 7, 230, 12, 187, 115, 107, 115, 12, 47, 452, 133, 46, 18, 6, 47, 12, 28, 441, 31, 178, 93, 31, 84, 187, 46, 99, 7, 8
293, 12, 23, 258, 7, 230, 12, 187, 242, 354, 452, 133, 21, 47, 18, 85, 28, 441, 31, 178, 22, 215, 187, 21, 47, 99, 7, 8
293, 107, 258, 7, 6, 245, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 45, 20, 22, 47, 684, 6, 133, 21, 47, 99, 7, 8
293, 12, 23, 258, 7, 6, 7, 12, 12, 187, 115, 12, 354, 452, 133, 46, 18, 6, 73, 28, 441, 31, 178, 93, 31, 21, 42, 6, 133, 21, 47, 99, 7, 8
293, 12, 23, 258, 7, 6, 245, 12, 187, 115, 107, 115, 12, 47, 6, 7, 45, 49, 133, 46, 18, 6, 73, 28, 134, 52, 684, 187, 46, 99, 7, 8
293, 107, 258, 7, 230, 12, 187, 242, 126, 51, 6, 7, 45, 49, 133, 46, 18, 6, 47, 12, 28, 45, 20, 52, 31, 178, 22, 23, 105, 187, 21, 47, 99, 7, 8
293, 107, 258, 7, 6, 245, 12, 187, 115, 107, 115, 12, 47, 67, 45, 49, 133, 21, 47, 18, 6, 73, 28, 45, 20, 22, 47, 31, 178, 22, 23, 31, 84, 187, 21, 47, 99, 7, 8

Now I load the same model into a transformers DebertaV2Tokenizer (same settings for subword regularization) and do the same thing:

from transformers import DebertaV2Tokenizer

tokenizer_deberta = DebertaV2Tokenizer(
    vocab_file  = "spTest.model",
    sp_model_kwargs = {
        'enable_sampling': True,
        'nbest_size': -1,
        'alpha': 0.1
    })

for _ in range(10):
    print( ", ".join( [str(x) for x in tokenizer_deberta.encode( str_example_text ) ] ) )

This also seems to work (it is nice to see the beginning/end of sequence tokens pop up in here):

2, 293, 107, 258, 7, 230, 12, 187, 242, 23, 115, 12, 47, 452, 133, 46, 18, 85, 28, 134, 22, 201, 178, 93, 105, 6, 133, 46, 99, 7, 8, 3
2, 6, 219, 107, 258, 7, 6, 7, 12, 12, 187, 115, 12, 23, 115, 12, 47, 6, 7, 45, 49, 133, 46, 18, 6, 73, 28, 45, 20, 52, 684, 6, 133, 46, 99, 7, 8, 3
2, 6, 219, 107, 258, 7, 230, 12, 6, 133, 115, 107, 115, 51, 6, 7, 45, 49, 133, 46, 18, 6, 47, 12, 28, 441, 31, 178, 22, 23, 31, 84, 187, 21, 47, 99, 7, 8, 3
2, 6, 219, 12, 23, 258, 7, 230, 12, 6, 133, 115, 12, 126, 12, 47, 452, 133, 46, 18, 217, 12, 28, 134, 22, 47, 31, 178, 93, 31, 84, 6, 133, 46, 99, 7, 8, 3
2, 293, 12, 23, 258, 7, 6, 7, 12, 12, 6, 133, 242, 354, 452, 133, 21, 47, 18, 85, 28, 134, 52, 684, 6, 133, 46, 99, 7, 8, 3
2, 293, 12, 23, 258, 7, 67, 12, 12, 187, 115, 12, 354, 452, 133, 46, 18, 6, 73, 28, 45, 20, 52, 31, 178, 93, 31, 21, 42, 187, 21, 47, 99, 7, 8, 3
2, 293, 12, 23, 258, 7, 230, 12, 187, 242, 354, 452, 133, 21, 47, 18, 6, 73, 28, 134, 22, 201, 178, 91, 187, 21, 47, 99, 7, 8, 3
2, 6, 219, 12, 23, 258, 7, 6, 7, 12, 12, 187, 115, 12, 126, 51, 452, 133, 46, 18, 217, 12, 28, 45, 20, 52, 31, 178, 22, 23, 31, 84, 187, 46, 99, 7, 8, 3
2, 293, 107, 258, 7, 230, 12, 187, 115, 12, 354, 6, 7, 45, 49, 133, 46, 18, 6, 73, 28, 134, 22, 201, 178, 22, 23, 105, 6, 133, 21, 47, 99, 7, 8, 3
2, 293, 12, 23, 258, 7, 6, 245, 12, 187, 115, 12, 354, 452, 133, 21, 47, 18, 6, 47, 12, 28, 45, 112, 47, 31, 178, 22, 215, 6, 133, 21, 47, 99, 7, 8, 3

Of course I want to be faster, so I also try the transformer DebertaV2TokenizerFast:

tokenizer_deberta_fast = DebertaV2TokenizerFast(
    vocab_file  = "spTest.model",
    sp_model_kwargs = {
        'enable_sampling': True,
        'nbest_size': -1,
        'alpha': 0.1
    }
)

for _ in range(10):
    print( ", ".join( [str(x) for x in tokenizer_deberta_fast.encode( str_example_text ) ] ) )

And here, it is not working anymore:

2, 293, 107, 258, 7, 230, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 441, 684, 187, 46, 99, 7, 8, 3
2, 293, 107, 258, 7, 230, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 441, 684, 187, 46, 99, 7, 8, 3
2, 293, 107, 258, 7, 230, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 441, 684, 187, 46, 99, 7, 8, 3
2, 293, 107, 258, 7, 230, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 441, 684, 187, 46, 99, 7, 8, 3
2, 293, 107, 258, 7, 230, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 441, 684, 187, 46, 99, 7, 8, 3
2, 293, 107, 258, 7, 230, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 441, 684, 187, 46, 99, 7, 8, 3
2, 293, 107, 258, 7, 230, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 441, 684, 187, 46, 99, 7, 8, 3
2, 293, 107, 258, 7, 230, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 441, 684, 187, 46, 99, 7, 8, 3
2, 293, 107, 258, 7, 230, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 441, 684, 187, 46, 99, 7, 8, 3
2, 293, 107, 258, 7, 230, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 441, 684, 187, 46, 99, 7, 8, 3

Does anyone know whether this is intentional? Here is the snippet from the documentation that makes me think it should work:

  • sp_model_kwargs (dict, optional) ā€” Will be passed to the SentencePieceProcessor.__init__() method. The Python wrapper for SentencePiece can be used, among other things, to set:
  • enable_sampling: Enable subword regularization.
  • nbest_size: Sampling parameters for unigram. Invalid for BPE-Dropout.
    • nbest_size = {0,1}: No sampling is performed.
    • nbest_size > 1: samples from the nbest_size results.
    • nbest_size < 0: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.
  • alpha: Smoothing parameter for unigram sampling, and dropout probability of merge operations for BPE-dropout.

Thank you all for your help.

1 Like