Tokenizer decoding using BERT, RoBERTa, XLNet, GPT2

facehugger2020 · September 14, 2020, 9:00pm

I’ve been using BERT and am fairly familiar with it at this point. I’m now trying out RoBERTa, XLNet, and GPT2. When I try to do basic tokenizer encoding and decoding, I’m getting unexpected output.

Here is an example of using BERT for tokenization and decoding:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
result = tokenizer(text='the needs of the many', text_pair='outweigh the needs of the few')
input_ids = result['input_ids']
print(input_ids)
print(tokenizer.decode(input_ids))
print(tokenizer.convert_ids_to_tokens(input_ids))

The output is expected:

[101, 1996, 3791, 1997, 1996, 2116, 102, 2041, 27204, 2232, 1996, 3791, 1997, 1996, 2261, 102]
[CLS] the needs of the many [SEP] outweigh the needs of the few [SEP]
['[CLS]', 'the', 'needs', 'of', 'the', 'many', '[SEP]', 'out', '##weig', '##h', 'the', 'needs', 'of', 'the', 'few', '[SEP]']

I understand the special tokens like [CLS] and the wordpiece tokens like ##weig.

However, when I try other models, I get crazy output.

RoBERTa

tokenizer = AutoTokenizer.from_pretrained('roberta-base')
result = tokenizer(text='the needs of the many', text_pair='outweigh the needs of the few')
input_ids = result['input_ids']
print(input_ids)
print(tokenizer.decode(input_ids))
print(tokenizer.convert_ids_to_tokens(input_ids))

Output:

[0, 627, 782, 9, 5, 171, 2, 2, 995, 1694, 8774, 5, 782, 9, 5, 367, 2]
<s>the needs of the many</s></s>outweigh the needs of the few</s>
['<s>', 'the', 'Ġneeds', 'Ġof', 'Ġthe', 'Ġmany', '</s>', '</s>', 'out', 'we', 'igh', 'Ġthe', 'Ġneeds', 'Ġof', 'Ġthe', 'Ġfew', '</s>']

Where are those Ġ characters coming from?

XLNet

tokenizer = AutoTokenizer.from_pretrained('xlnet-base-cased')
result = tokenizer(text='the needs of the many', text_pair='outweigh the needs of the few')
input_ids = result['input_ids']
print(input_ids)
print(tokenizer.decode(input_ids))
print(tokenizer.convert_ids_to_tokens(input_ids))

Output:

/usr/local/lib/python3.6/dist-packages/transformers/configuration_xlnet.py:211: FutureWarning: This config doesn't use attention memories, a core feature of XLNet. Consider setting `men_len` to a non-zero value, for example `xlnet = XLNetLMHeadModel.from_pretrained('xlnet-base-cased'', mem_len=1024)`, for accurate training performance as well as an order of magnitude faster inference. Starting from version 3.5.0, the default parameter will be 1024, following the implementation in https://arxiv.org/abs/1906.08237
  FutureWarning,
[18, 794, 20, 18, 142, 4, 23837, 18, 794, 20, 18, 274, 4, 3]
the needs of the many<sep> outweigh the needs of the few<sep><cls>
['▁the', '▁needs', '▁of', '▁the', '▁many', '<sep>', '▁outweigh', '▁the', '▁needs', '▁of', '▁the', '▁few', '<sep>', '<cls>']

Why are there underscore ▁ characters?

GPT2

tokenizer = AutoTokenizer.from_pretrained('gpt2')
result = tokenizer(text='the needs of the many', text_pair='outweigh the needs of the few')
input_ids = result['input_ids']
print(input_ids)
print(tokenizer.decode(input_ids))
print(tokenizer.convert_ids_to_tokens(input_ids))

Output:

[1169, 2476, 286, 262, 867, 448, 732, 394, 262, 2476, 286, 262, 1178]
the needs of the manyoutweigh the needs of the few
['the', 'Ġneeds', 'Ġof', 'Ġthe', 'Ġmany', 'out', 'we', 'igh', 'Ġthe', 'Ġneeds', 'Ġof', 'Ġthe', 'Ġfew']

Again, where are those Ġ characters coming from?

I understand there are different subword tokenization schemes used by each. I also have the original research papers. Can someone please explain how the Huggingface Transformers implementation is producing these different outputs?

sgugger · September 15, 2020, 12:14pm

Each tokenizer has its own way of representing pieces of the same word, because the model expact them in different ways. For instance Bert model expects BPE tokenization because it was pretrained this way. GPT-2 and RoBERTa expect byet-level BPE (because they were pretrained this way) which results in outputs with those Ġ (which basically represent space) whereas XLNet expects sentencepiece tokenized texts, which uses ▁ to represent the space character.

You can find a high-level summary of the tokenizers and which one is used for each model in this doc page.

facehugger2020 · September 15, 2020, 9:15pm

Thank you for the pointer to the docs.

facehugger2020 · September 20, 2020, 4:37am

@sgugger Is there a bug with the GPT2 Tokenizer when encoding sentence pairs?

tokenizer = AutoTokenizer.from_pretrained('gpt2')
result = tokenizer(text='the needs of the many', text_pair='outweigh the needs of the few')
input_ids = result['input_ids']
print(tokenizer.decode(input_ids))
print(tokenizer.convert_ids_to_tokens(input_ids))

Output:

the needs of the manyoutweigh the needs of the few
['the', 'Ġneeds', 'Ġof', 'Ġthe', 'Ġmany', 'out', 'we', 'igh', 'Ġthe', 'Ġneeds', 'Ġof', 'Ġthe', 'Ġfew']

There should be a separator token between the two sentences. For GPT, the token should be $. It’s stated in the original GPT paper, section 3.3:

Textual entailment For entailment tasks, we concatenate the premise p and hypothesis h token
sequences, with a delimiter token ($) in between.

The other tokenizers, e.g. BertTokenizer and XLNetTokenizer, properly introduce separation tokens between sentence pairs.

sgugger · September 20, 2020, 1:29pm

Not familiar with the GPT tokenizer myself so tagging @anthony who might have more insight.

thomwolf · September 20, 2020, 9:41pm

Hi, the $ in the original paper is a symbol representing an additional token that you need to create and add for the finetuning of the model. On the contrary to BERT and more recent models, GPT/GPT2 don’t have a native separation token that you can use to separate pair of sentences.

You have to manually add an additional token to the vocabulary (you can use tokenizer.add_tokens() or tokenizer.add_special_tokens() in transformers for instance) and then train the model on a dataset with pair of sentence to learn an embedding for this new token. You also need to concatenate the sentences yourself with the added token or the pattern you have selected to concatenate pair of sentences.

facehugger2020 · September 20, 2020, 10:30pm

You also need to concatenate the sentences yourself with the added token or the pattern you have selected to concatenate pair of sentences.

Aren’t these steps supposed to be part of the tokenizer(text='...', text_pair='...') function call? The other tokenizers (BERT, XLNet, etc.) will concatenate the sentence pair and add the separator symbol.

thomwolf · September 21, 2020, 10:30am

We only include these steps when they are fully pretrained.

For GPT2 you need to add a new token to the model and fine-tune it so we don’t do it in the tokenizer.

Topic		Replies	Views
Inconsistencies between BERT and RoBERTa: what am I doing wrong? Beginners	0	358	May 11, 2022
Can we use tokenizer from one architecture and model from another one? Beginners	2	753	September 30, 2021
RobertaTokenizer decode and tokenize do not have the same output 🤗Tokenizers	0	242	October 24, 2023
Train a new tokenizer from scratch 🤗Transformers	4	1643	November 10, 2020
Tokenized sequence lengths 🤗Tokenizers	6	1826	March 10, 2022

Tokenizer decoding using BERT, RoBERTa, XLNet, GPT2

Related topics