Hello everyone,
I have a naive question about tokenizers, particularly GPT2 Tokenizer. I have this encoded a text sentence, and I’ve obtained the token: 29826, which in GPT2Tokenizer Vocabulary corresponds to the Unicode sequence “\u00e6\u0143”.
For some reason, I needed to convert 29826 back to its token, i.e., into text, so I used the following code snippet:
from transformers import GPT2Tokenizer, AutoModelForCausalLM
import torch
import torch.nn.functional as F
tokenizer = GPT2Tokenizer.from_pretrained( "EleutherAI/gpt-neo-125M")
model = AutoModelForCausalLM.from_pretrained( "EleutherAI/gpt-neo-125M")
tokenizer.pad_token_id = tokenizer.eos_token_id
I found myself comparing the logits distribution for token_id 29826 using the code below:
# Version 1. No need to encode because we already have access to token_id
token_ids_1 = torch.tensor([[29826]])
logits_1 = model.forward(token_ids_1).logits.squeeze().detach().numpy()
# Version 2. Need to get input_ids
token = tokenizer.decode([29826])
token_ids_2 = tokenizer(token, return_tensors="pt", add_special_tokens=False).input_ids
logits_2 = model.forward(token_ids_2).logits.squeeze().detach().numpy()
## Visualize the logits distribution
import matplotlib.pyplot as plt
import seaborn as sns
sns.histplot(logits_1, label="Logits 1")
sns.histplot(logits_2, label="Logits 2")
plt.legend()
Surprisingly, the logit distributions were different (see attached picture). It appears that having this tokenizer.decode
conversion step is actually creating the string ‘�’, which is being decoded to the index 4210 (a different unicode sequence " \u00ef\u00bf\u00bd"
).
In fact, both expressions tokenizer.decode([4210])
and tokenizer.decode([29826])
get decoded to the same ‘�’ character instead of their actual unique expression. But I was hoping that they would decode to their actual unicode string.
Is there any way I can deal with this? Is this expected? I’ve tried tweaking the string to decode to its actual unicode string but failed.
Environment:
-
transformers
version: 4.26.1 -
tokenizers
version: 0.13.2 - Platform: Linux-5.4.0-113-generic-x86_64-with-glibc2.31
- Python version: 3.9.16
- Huggingface_hub version: 0.12.0
- PyTorch version (GPU?): 1.12.1 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: False
- Using distributed or parallel set-up in script?: False
I executed the code snippets on a Jupyter Notebook (jupyterlab 3.6.1).