GPT2Tokenizer.decode maps unicode sequences to the same string '�'

Hello everyone,

I have a naive question about tokenizers, particularly GPT2 Tokenizer. I have this encoded a text sentence, and I’ve obtained the token: 29826, which in GPT2Tokenizer Vocabulary corresponds to the Unicode sequence “\u00e6\u0143”.

For some reason, I needed to convert 29826 back to its token, i.e., into text, so I used the following code snippet:

from transformers import GPT2Tokenizer, AutoModelForCausalLM
import torch
import torch.nn.functional as F

tokenizer = GPT2Tokenizer.from_pretrained( "EleutherAI/gpt-neo-125M")
model = AutoModelForCausalLM.from_pretrained( "EleutherAI/gpt-neo-125M")
tokenizer.pad_token_id = tokenizer.eos_token_id

I found myself comparing the logits distribution for token_id 29826 using the code below:

# Version 1. No need to encode because we already have access to token_id
token_ids_1 = torch.tensor([[29826]])
logits_1 = model.forward(token_ids_1).logits.squeeze().detach().numpy()

# Version 2. Need to get input_ids
token = tokenizer.decode([29826])
token_ids_2 = tokenizer(token, return_tensors="pt", add_special_tokens=False).input_ids
logits_2 = model.forward(token_ids_2).logits.squeeze().detach().numpy()

## Visualize the logits distribution
import matplotlib.pyplot as plt
import seaborn as sns

sns.histplot(logits_1, label="Logits 1")
sns.histplot(logits_2, label="Logits 2")
plt.legend()

Surprisingly, the logit distributions were different (see attached picture). It appears that having this tokenizer.decode conversion step is actually creating the string ‘�’, which is being decoded to the index 4210 (a different unicode sequence " \u00ef\u00bf\u00bd").

image

In fact, both expressions tokenizer.decode([4210]) and tokenizer.decode([29826]) get decoded to the same ‘�’ character instead of their actual unique expression. But I was hoping that they would decode to their actual unicode string.

Is there any way I can deal with this? Is this expected? I’ve tried tweaking the string to decode to its actual unicode string but failed.

Environment:

  • transformers version: 4.26.1
  • tokenizers version: 0.13.2
  • Platform: Linux-5.4.0-113-generic-x86_64-with-glibc2.31
  • Python version: 3.9.16
  • Huggingface_hub version: 0.12.0
  • PyTorch version (GPU?): 1.12.1 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: False
  • Using distributed or parallel set-up in script?: False

I executed the code snippets on a Jupyter Notebook (jupyterlab 3.6.1).

I guess the more appropriate way would be to use: tokenizer.convert_ids_to_tokens([4210])

On the use of tokenizer.convert_ids_tokens, I’ve found that. the two expressions below lead to two different strings:

ts = tokenizer.encode("Hello!\n I can't do this anymore")
# [15496, 0, 198, 314, 460, 470, 466, 428, 7471]

and

tokenizer.convert_ids_to_tokens(ts)
# ['Hello', '!', 'Ċ', 'ĠI', 'Ġcan', "'t", 'Ġdo', 'Ġthis', 'Ġanymore']

How should I interpret the difference between the two?

Based on https://discuss.huggingface.co/t/bpe-tokenizers-and-spaces-before-words/475:

In GPT2 and Roberta tokenizers, the space before a word is part of a word, i.e. "Hello how are you puppetter" will be tokenized in ["Hello", "Ġhow", "Ġare", "Ġyou", "Ġpuppet", "ter"]. You can notice the spaces included in the words a Ġ here. Spaces are converted in a special character (the Ġ ) in the tokenizer prior to BPE splitting mostly to avoid digesting spaces since the standard BPE algorithm used spaces in its process (this can seem a bit hacky but was in the original GPT2 tokenizer implementation by OpenAI).