Does Deberta tokenizer use wordpiece?

Trunway · August 6, 2022, 9:05am

From the docs of hugging face:

Constructs a DeBERTa tokenizer, which runs end-to-end tokenization: punctuation splitting + workpiece

The answer is positive. However, when I checked results tokenized by other models’ tokenizers, the results were confusing. I checked four models in total, respectively deberta, bert, roberta and albert. The tokenized results of the string “Hugging face is of great help to implement models!” by the four models’ tokenizers are respectively shown as follows.

My code:

tokenizer1 = AutoTokenizer.from_pretrained("microsoft/deberta-base")
tokenizer2 = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer3 = AutoTokenizer.from_pretrained("roberta-base")
tokenizer4 = AutoTokenizer.from_pretrained("albert-xxlarge-v2")

test_str = "Hugging face is of great help to implement models!"
print(tokenizer1.tokenize(test_str))
print(tokenizer2.tokenize(test_str))
print(tokenizer3.tokenize(test_str))
print(tokenizer4.tokenize(test_str))

Results:

['Hug', 'ging', 'Ġface', 'Ġis', 'Ġof', 'Ġgreat', 'Ġhelp', 'Ġto', 'Ġimplement', 'Ġmodels', '!']
['hugging', 'face', 'is', 'of', 'great', 'help', 'to', 'implement', 'models', '!']
['Hug', 'ging', 'Ġface', 'Ġis', 'Ġof', 'Ġgreat', 'Ġhelp', 'Ġto', 'Ġimplement', 'Ġmodels', '!']
['▁hugging', '▁face', '▁is', '▁of', '▁great', '▁help', '▁to', '▁implement', '▁models', '!']

As you can see, the results of deberta is the same as roberta, which, as described in the hugging face docs, is derived from the GPT-2 tokenizer, using byte-level Byte-Pair-Encoding. So, does deberta use byte-level Byte-Pair-Encoding, instead of wordpiece like bert (the results of deberta and bert are different from each other .)? Besides, I also checked the codes for deberta tokenizer, and found that the _tokenize() method is derived from GPT2 tokenizer as well (class DebertaTokenizer(GPT2Tokenizer):).

So, is there any problem with the content of the docs, or if anyone could help to explain this? Thank you for your reply!

Topic		Replies	Views
Can someone help guide how to finetune DeBERTa V3 model? Models	1	1206	August 25, 2024
DeBERTa use for NLI tasks - Missing contradiction score Models	1	591	January 25, 2021
Access word piece tokens from BERT tokenized dataset 🤗Datasets	2	933	November 17, 2021
Subword regularization in Sentencepiece and DeBERTaV2 tokenizers (not working) 🤗Transformers	0	695	February 1, 2023
Doubts about the tokenization strategy and the explanation of models through SHAP 🤗Tokenizers	0	231	May 22, 2024

Does Deberta tokenizer use wordpiece?

Related topics