Hi everyone,
I noticed something when I trained a Sentencepiece tokenizer with the Sentencepiece library and then used it in the DebertaV2Tokenizer and DebertaV2TokenizerFast: it seems to me that the subword regularization is not working in the āfastā one, even though according to the documentation, it should.
Maybe someone can confirm this or tell me what I am doing wrong.
Tested with Sentencepiece 0.1.97, transformers 4.26.
Here is how I train the Sentencepiece model:
import sentencepiece as spm
spm.SentencePieceTrainer.Train(
input=<FILENAME>
model_prefix='spTest',
vocab_size=1000,
pad_id=0,
unk_id=1,
bos_id=2,
eos_id=3,
pad_piece='[PAD]',
unk_piece='[UNK]',
bos_piece='[CLS]',
eos_piece='[SEP]',
user_defined_symbols='[MASK]',
model_type='unigram'
)
I load the trained model so I can test it, also setting the parameters that control the subword regularization:
sp_processor = spm.SentencePieceProcessor(
model_file="spTest.model",
enable_sampling=True,
nbest_size=-1,
alpha=0.1,
)
Testing whether it works:
str_example_text = "Let's see whether subword regularization works."
for _ in range(10):
print( ", ".join( [str(x) for x in sp_processor.encode( str_example_text ) ] ) )
Looks to me like it does:
293, 107, 258, 7, 67, 12, 12, 187, 242, 126, 51, 452, 133, 21, 47, 18, 217, 12, 28, 134, 52, 684, 187, 46, 99, 7, 8
6, 219, 107, 258, 7, 230, 12, 6, 133, 242, 126, 12, 47, 67, 45, 49, 133, 21, 47, 18, 6, 73, 28, 134, 22, 201, 178, 22, 215, 187, 46, 99, 7, 8
293, 12, 23, 258, 7, 6, 7, 12, 12, 6, 133, 242, 23, 242, 47, 452, 133, 21, 47, 18, 6, 73, 28, 134, 52, 31, 178, 22, 215, 6, 133, 46, 99, 7, 8
6, 219, 12, 23, 258, 7, 230, 12, 187, 115, 107, 115, 12, 47, 452, 133, 46, 18, 6, 47, 12, 28, 441, 31, 178, 93, 31, 84, 187, 46, 99, 7, 8
293, 12, 23, 258, 7, 230, 12, 187, 242, 354, 452, 133, 21, 47, 18, 85, 28, 441, 31, 178, 22, 215, 187, 21, 47, 99, 7, 8
293, 107, 258, 7, 6, 245, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 45, 20, 22, 47, 684, 6, 133, 21, 47, 99, 7, 8
293, 12, 23, 258, 7, 6, 7, 12, 12, 187, 115, 12, 354, 452, 133, 46, 18, 6, 73, 28, 441, 31, 178, 93, 31, 21, 42, 6, 133, 21, 47, 99, 7, 8
293, 12, 23, 258, 7, 6, 245, 12, 187, 115, 107, 115, 12, 47, 6, 7, 45, 49, 133, 46, 18, 6, 73, 28, 134, 52, 684, 187, 46, 99, 7, 8
293, 107, 258, 7, 230, 12, 187, 242, 126, 51, 6, 7, 45, 49, 133, 46, 18, 6, 47, 12, 28, 45, 20, 52, 31, 178, 22, 23, 105, 187, 21, 47, 99, 7, 8
293, 107, 258, 7, 6, 245, 12, 187, 115, 107, 115, 12, 47, 67, 45, 49, 133, 21, 47, 18, 6, 73, 28, 45, 20, 22, 47, 31, 178, 22, 23, 31, 84, 187, 21, 47, 99, 7, 8
Now I load the same model into a transformers DebertaV2Tokenizer (same settings for subword regularization) and do the same thing:
from transformers import DebertaV2Tokenizer
tokenizer_deberta = DebertaV2Tokenizer(
vocab_file = "spTest.model",
sp_model_kwargs = {
'enable_sampling': True,
'nbest_size': -1,
'alpha': 0.1
})
for _ in range(10):
print( ", ".join( [str(x) for x in tokenizer_deberta.encode( str_example_text ) ] ) )
This also seems to work (it is nice to see the beginning/end of sequence tokens pop up in here):
2, 293, 107, 258, 7, 230, 12, 187, 242, 23, 115, 12, 47, 452, 133, 46, 18, 85, 28, 134, 22, 201, 178, 93, 105, 6, 133, 46, 99, 7, 8, 3
2, 6, 219, 107, 258, 7, 6, 7, 12, 12, 187, 115, 12, 23, 115, 12, 47, 6, 7, 45, 49, 133, 46, 18, 6, 73, 28, 45, 20, 52, 684, 6, 133, 46, 99, 7, 8, 3
2, 6, 219, 107, 258, 7, 230, 12, 6, 133, 115, 107, 115, 51, 6, 7, 45, 49, 133, 46, 18, 6, 47, 12, 28, 441, 31, 178, 22, 23, 31, 84, 187, 21, 47, 99, 7, 8, 3
2, 6, 219, 12, 23, 258, 7, 230, 12, 6, 133, 115, 12, 126, 12, 47, 452, 133, 46, 18, 217, 12, 28, 134, 22, 47, 31, 178, 93, 31, 84, 6, 133, 46, 99, 7, 8, 3
2, 293, 12, 23, 258, 7, 6, 7, 12, 12, 6, 133, 242, 354, 452, 133, 21, 47, 18, 85, 28, 134, 52, 684, 6, 133, 46, 99, 7, 8, 3
2, 293, 12, 23, 258, 7, 67, 12, 12, 187, 115, 12, 354, 452, 133, 46, 18, 6, 73, 28, 45, 20, 52, 31, 178, 93, 31, 21, 42, 187, 21, 47, 99, 7, 8, 3
2, 293, 12, 23, 258, 7, 230, 12, 187, 242, 354, 452, 133, 21, 47, 18, 6, 73, 28, 134, 22, 201, 178, 91, 187, 21, 47, 99, 7, 8, 3
2, 6, 219, 12, 23, 258, 7, 6, 7, 12, 12, 187, 115, 12, 126, 51, 452, 133, 46, 18, 217, 12, 28, 45, 20, 52, 31, 178, 22, 23, 31, 84, 187, 46, 99, 7, 8, 3
2, 293, 107, 258, 7, 230, 12, 187, 115, 12, 354, 6, 7, 45, 49, 133, 46, 18, 6, 73, 28, 134, 22, 201, 178, 22, 23, 105, 6, 133, 21, 47, 99, 7, 8, 3
2, 293, 12, 23, 258, 7, 6, 245, 12, 187, 115, 12, 354, 452, 133, 21, 47, 18, 6, 47, 12, 28, 45, 112, 47, 31, 178, 22, 215, 6, 133, 21, 47, 99, 7, 8, 3
Of course I want to be faster, so I also try the transformer DebertaV2TokenizerFast:
tokenizer_deberta_fast = DebertaV2TokenizerFast(
vocab_file = "spTest.model",
sp_model_kwargs = {
'enable_sampling': True,
'nbest_size': -1,
'alpha': 0.1
}
)
for _ in range(10):
print( ", ".join( [str(x) for x in tokenizer_deberta_fast.encode( str_example_text ) ] ) )
And here, it is not working anymore:
2, 293, 107, 258, 7, 230, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 441, 684, 187, 46, 99, 7, 8, 3
2, 293, 107, 258, 7, 230, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 441, 684, 187, 46, 99, 7, 8, 3
2, 293, 107, 258, 7, 230, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 441, 684, 187, 46, 99, 7, 8, 3
2, 293, 107, 258, 7, 230, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 441, 684, 187, 46, 99, 7, 8, 3
2, 293, 107, 258, 7, 230, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 441, 684, 187, 46, 99, 7, 8, 3
2, 293, 107, 258, 7, 230, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 441, 684, 187, 46, 99, 7, 8, 3
2, 293, 107, 258, 7, 230, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 441, 684, 187, 46, 99, 7, 8, 3
2, 293, 107, 258, 7, 230, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 441, 684, 187, 46, 99, 7, 8, 3
2, 293, 107, 258, 7, 230, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 441, 684, 187, 46, 99, 7, 8, 3
2, 293, 107, 258, 7, 230, 12, 187, 242, 354, 452, 133, 46, 18, 85, 28, 441, 684, 187, 46, 99, 7, 8, 3
Does anyone know whether this is intentional? Here is the snippet from the documentation that makes me think it should work:
- sp_model_kwargs (
dict
, optional) ā Will be passed to theSentencePieceProcessor.__init__()
method. The Python wrapper for SentencePiece can be used, among other things, to set:enable_sampling
: Enable subword regularization.nbest_size
: Sampling parameters for unigram. Invalid for BPE-Dropout.
nbest_size = {0,1}
: No sampling is performed.nbest_size > 1
: samples from the nbest_size results.nbest_size < 0
: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.alpha
: Smoothing parameter for unigram sampling, and dropout probability of merge operations for BPE-dropout.
Thank you all for your help.