From the docs of hugging face:
Constructs a DeBERTa tokenizer, which runs end-to-end tokenization: punctuation splitting + workpiece
The answer is positive. However, when I checked results tokenized by other models’ tokenizers, the results were confusing. I checked four models in total, respectively deberta, bert, roberta and albert. The tokenized results of the string “Hugging face is of great help to implement models!” by the four models’ tokenizers are respectively shown as follows.
tokenizer1 = AutoTokenizer.from_pretrained("microsoft/deberta-base") tokenizer2 = AutoTokenizer.from_pretrained("bert-base-uncased") tokenizer3 = AutoTokenizer.from_pretrained("roberta-base") tokenizer4 = AutoTokenizer.from_pretrained("albert-xxlarge-v2") test_str = "Hugging face is of great help to implement models!" print(tokenizer1.tokenize(test_str)) print(tokenizer2.tokenize(test_str)) print(tokenizer3.tokenize(test_str)) print(tokenizer4.tokenize(test_str))
['Hug', 'ging', 'Ġface', 'Ġis', 'Ġof', 'Ġgreat', 'Ġhelp', 'Ġto', 'Ġimplement', 'Ġmodels', '!'] ['hugging', 'face', 'is', 'of', 'great', 'help', 'to', 'implement', 'models', '!'] ['Hug', 'ging', 'Ġface', 'Ġis', 'Ġof', 'Ġgreat', 'Ġhelp', 'Ġto', 'Ġimplement', 'Ġmodels', '!'] ['▁hugging', '▁face', '▁is', '▁of', '▁great', '▁help', '▁to', '▁implement', '▁models', '!']
As you can see, the results of deberta is the same as roberta, which, as described in the hugging face docs, is derived from the GPT-2 tokenizer, using byte-level Byte-Pair-Encoding. So, does deberta use byte-level Byte-Pair-Encoding, instead of wordpiece like bert (the results of deberta and bert are different from each other .)? Besides, I also checked the codes for deberta tokenizer, and found that the
_tokenize() method is derived from GPT2 tokenizer as well (
So, is there any problem with the content of the docs, or if anyone could help to explain this? Thank you for your reply!