Tokenizer decoding using BERT, RoBERTa, XLNet, GPT2

I’ve been using :hugs: BERT and am fairly familiar with it at this point. I’m now trying out RoBERTa, XLNet, and GPT2. When I try to do basic tokenizer encoding and decoding, I’m getting unexpected output.

Here is an example of using BERT for tokenization and decoding:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
result = tokenizer(text='the needs of the many', text_pair='outweigh the needs of the few')
input_ids = result['input_ids']
print(input_ids)
print(tokenizer.decode(input_ids))
print(tokenizer.convert_ids_to_tokens(input_ids))

The output is expected:

[101, 1996, 3791, 1997, 1996, 2116, 102, 2041, 27204, 2232, 1996, 3791, 1997, 1996, 2261, 102]
[CLS] the needs of the many [SEP] outweigh the needs of the few [SEP]
['[CLS]', 'the', 'needs', 'of', 'the', 'many', '[SEP]', 'out', '##weig', '##h', 'the', 'needs', 'of', 'the', 'few', '[SEP]']

I understand the special tokens like [CLS] and the wordpiece tokens like ##weig.

However, when I try other models, I get crazy output.

RoBERTa

tokenizer = AutoTokenizer.from_pretrained('roberta-base')
result = tokenizer(text='the needs of the many', text_pair='outweigh the needs of the few')
input_ids = result['input_ids']
print(input_ids)
print(tokenizer.decode(input_ids))
print(tokenizer.convert_ids_to_tokens(input_ids))

Output:

[0, 627, 782, 9, 5, 171, 2, 2, 995, 1694, 8774, 5, 782, 9, 5, 367, 2]
<s>the needs of the many</s></s>outweigh the needs of the few</s>
['<s>', 'the', 'Ġneeds', 'Ġof', 'Ġthe', 'Ġmany', '</s>', '</s>', 'out', 'we', 'igh', 'Ġthe', 'Ġneeds', 'Ġof', 'Ġthe', 'Ġfew', '</s>']

Where are those Ġ characters coming from?

XLNet

tokenizer = AutoTokenizer.from_pretrained('xlnet-base-cased')
result = tokenizer(text='the needs of the many', text_pair='outweigh the needs of the few')
input_ids = result['input_ids']
print(input_ids)
print(tokenizer.decode(input_ids))
print(tokenizer.convert_ids_to_tokens(input_ids))

Output:

/usr/local/lib/python3.6/dist-packages/transformers/configuration_xlnet.py:211: FutureWarning: This config doesn't use attention memories, a core feature of XLNet. Consider setting `men_len` to a non-zero value, for example `xlnet = XLNetLMHeadModel.from_pretrained('xlnet-base-cased'', mem_len=1024)`, for accurate training performance as well as an order of magnitude faster inference. Starting from version 3.5.0, the default parameter will be 1024, following the implementation in https://arxiv.org/abs/1906.08237
  FutureWarning,
[18, 794, 20, 18, 142, 4, 23837, 18, 794, 20, 18, 274, 4, 3]
the needs of the many<sep> outweigh the needs of the few<sep><cls>
['▁the', '▁needs', '▁of', '▁the', '▁many', '<sep>', '▁outweigh', '▁the', '▁needs', '▁of', '▁the', '▁few', '<sep>', '<cls>']

Why are there underscore characters?

GPT2

tokenizer = AutoTokenizer.from_pretrained('gpt2')
result = tokenizer(text='the needs of the many', text_pair='outweigh the needs of the few')
input_ids = result['input_ids']
print(input_ids)
print(tokenizer.decode(input_ids))
print(tokenizer.convert_ids_to_tokens(input_ids))

Output:

[1169, 2476, 286, 262, 867, 448, 732, 394, 262, 2476, 286, 262, 1178]
the needs of the manyoutweigh the needs of the few
['the', 'Ġneeds', 'Ġof', 'Ġthe', 'Ġmany', 'out', 'we', 'igh', 'Ġthe', 'Ġneeds', 'Ġof', 'Ġthe', 'Ġfew']

Again, where are those Ġ characters coming from?

I understand there are different subword tokenization schemes used by each. I also have the original research papers. Can someone please explain how the Huggingface Transformers implementation is producing these different outputs?

Each tokenizer has its own way of representing pieces of the same word, because the model expact them in different ways. For instance Bert model expects BPE tokenization because it was pretrained this way. GPT-2 and RoBERTa expect byet-level BPE (because they were pretrained this way) which results in outputs with those Ġ (which basically represent space) whereas XLNet expects sentencepiece tokenized texts, which uses ▁ to represent the space character.

You can find a high-level summary of the tokenizers and which one is used for each model in this doc page.

Thank you for the pointer to the docs.

@sgugger Is there a bug with the GPT2 Tokenizer when encoding sentence pairs?

tokenizer = AutoTokenizer.from_pretrained('gpt2')
result = tokenizer(text='the needs of the many', text_pair='outweigh the needs of the few')
input_ids = result['input_ids']
print(tokenizer.decode(input_ids))
print(tokenizer.convert_ids_to_tokens(input_ids))

Output:

the needs of the manyoutweigh the needs of the few
['the', 'Ġneeds', 'Ġof', 'Ġthe', 'Ġmany', 'out', 'we', 'igh', 'Ġthe', 'Ġneeds', 'Ġof', 'Ġthe', 'Ġfew']

There should be a separator token between the two sentences. For GPT, the token should be $. It’s stated in the original GPT paper, section 3.3:

Textual entailment For entailment tasks, we concatenate the premise p and hypothesis h token
sequences, with a delimiter token ($) in between.

The other tokenizers, e.g. BertTokenizer and XLNetTokenizer, properly introduce separation tokens between sentence pairs.

Not familiar with the GPT tokenizer myself so tagging @anthony who might have more insight.

Hi, the $ in the original paper is a symbol representing an additional token that you need to create and add for the finetuning of the model. On the contrary to BERT and more recent models, GPT/GPT2 don’t have a native separation token that you can use to separate pair of sentences.

You have to manually add an additional token to the vocabulary (you can use tokenizer.add_tokens() or tokenizer.add_special_tokens() in transformers for instance) and then train the model on a dataset with pair of sentence to learn an embedding for this new token. You also need to concatenate the sentences yourself with the added token or the pattern you have selected to concatenate pair of sentences.

You also need to concatenate the sentences yourself with the added token or the pattern you have selected to concatenate pair of sentences.

Aren’t these steps supposed to be part of the tokenizer(text='...', text_pair='...') function call? The other tokenizers (BERT, XLNet, etc.) will concatenate the sentence pair and add the separator symbol.

We only include these steps when they are fully pretrained.

For GPT2 you need to add a new token to the model and fine-tune it so we don’t do it in the tokenizer.