I’ve been using BERT and am fairly familiar with it at this point. I’m now trying out RoBERTa, XLNet, and GPT2. When I try to do basic tokenizer encoding and decoding, I’m getting unexpected output.
Here is an example of using BERT for tokenization and decoding:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
result = tokenizer(text='the needs of the many', text_pair='outweigh the needs of the few')
input_ids = result['input_ids']
print(input_ids)
print(tokenizer.decode(input_ids))
print(tokenizer.convert_ids_to_tokens(input_ids))
The output is expected:
[101, 1996, 3791, 1997, 1996, 2116, 102, 2041, 27204, 2232, 1996, 3791, 1997, 1996, 2261, 102]
[CLS] the needs of the many [SEP] outweigh the needs of the few [SEP]
['[CLS]', 'the', 'needs', 'of', 'the', 'many', '[SEP]', 'out', '##weig', '##h', 'the', 'needs', 'of', 'the', 'few', '[SEP]']
I understand the special tokens like [CLS]
and the wordpiece tokens like ##weig
.
However, when I try other models, I get crazy output.
RoBERTa
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
result = tokenizer(text='the needs of the many', text_pair='outweigh the needs of the few')
input_ids = result['input_ids']
print(input_ids)
print(tokenizer.decode(input_ids))
print(tokenizer.convert_ids_to_tokens(input_ids))
Output:
[0, 627, 782, 9, 5, 171, 2, 2, 995, 1694, 8774, 5, 782, 9, 5, 367, 2]
<s>the needs of the many</s></s>outweigh the needs of the few</s>
['<s>', 'the', 'Ġneeds', 'Ġof', 'Ġthe', 'Ġmany', '</s>', '</s>', 'out', 'we', 'igh', 'Ġthe', 'Ġneeds', 'Ġof', 'Ġthe', 'Ġfew', '</s>']
Where are those Ġ
characters coming from?
XLNet
tokenizer = AutoTokenizer.from_pretrained('xlnet-base-cased')
result = tokenizer(text='the needs of the many', text_pair='outweigh the needs of the few')
input_ids = result['input_ids']
print(input_ids)
print(tokenizer.decode(input_ids))
print(tokenizer.convert_ids_to_tokens(input_ids))
Output:
/usr/local/lib/python3.6/dist-packages/transformers/configuration_xlnet.py:211: FutureWarning: This config doesn't use attention memories, a core feature of XLNet. Consider setting `men_len` to a non-zero value, for example `xlnet = XLNetLMHeadModel.from_pretrained('xlnet-base-cased'', mem_len=1024)`, for accurate training performance as well as an order of magnitude faster inference. Starting from version 3.5.0, the default parameter will be 1024, following the implementation in https://arxiv.org/abs/1906.08237
FutureWarning,
[18, 794, 20, 18, 142, 4, 23837, 18, 794, 20, 18, 274, 4, 3]
the needs of the many<sep> outweigh the needs of the few<sep><cls>
['▁the', '▁needs', '▁of', '▁the', '▁many', '<sep>', '▁outweigh', '▁the', '▁needs', '▁of', '▁the', '▁few', '<sep>', '<cls>']
Why are there underscore ▁
characters?
GPT2
tokenizer = AutoTokenizer.from_pretrained('gpt2')
result = tokenizer(text='the needs of the many', text_pair='outweigh the needs of the few')
input_ids = result['input_ids']
print(input_ids)
print(tokenizer.decode(input_ids))
print(tokenizer.convert_ids_to_tokens(input_ids))
Output:
[1169, 2476, 286, 262, 867, 448, 732, 394, 262, 2476, 286, 262, 1178]
the needs of the manyoutweigh the needs of the few
['the', 'Ġneeds', 'Ġof', 'Ġthe', 'Ġmany', 'out', 'we', 'igh', 'Ġthe', 'Ġneeds', 'Ġof', 'Ġthe', 'Ġfew']
Again, where are those Ġ
characters coming from?
I understand there are different subword tokenization schemes used by each. I also have the original research papers. Can someone please explain how the Huggingface Transformers implementation is producing these different outputs?