Tokenizer not recognising words in vocabulary

I have an issue where a tokenizer doesn’t recognise tokens in its own vocabulary. A minimal example is:

from transformers import AutoTokenizer
model_checkpoint = ‘DeepChem/ChemBERTa-77M-MTR’
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
test_smiles = ‘CCC1=[O+]’
print(tokenizer.vocab[‘[O+]’])
print(tokenizer.tokenize(test_smiles))

this outputs:

73
[‘C’, ‘C’, ‘C’, ‘1’, ‘=’, ‘O’]

Notice that the '[O+]' expression is encoded simply as 'O', even though it is in the vocabulary. This loses important information.

(also posted here as I’m not sure where exactly the issue is)

Hi @Aron, the reason why [O+] is encoded as O may be the BPE encoding, but I haven’t found the way to correct it. However, I find an alternative way to solve this problem.

# Step1: Save the vocab.json from ChemBERTa pretrained model
from transformers import AutoTokenizer
model_checkpoint = 'DeepChem/ChemBERTa-77M-MTR'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
tokenizer.save_pretrained('/path/to/deepchem')

# Step2: Use WordLevel Model
from tokenizers.models import WordLevel
new_tokenizer = Tokenizer(
    WordLevel.from_file(
        '/path/to/deepchem/vocab.json', 
        unk_token='[UNK]'
))

# Step3: (important) Set pretokenizer to split the SMILES character
from tokenizers import Regex
pre_tokenizer = Split(
    pattern=Regex("\[(.*?)\]|.*?"),
    behavior='isolated'
)
pre_tokenizer.pre_tokenize_str('CCC1=[O+]') # You can test it with this line
new_tokenizer.pre_tokenizer = pre_tokenizer

# Step4: Check if tokenizer work correctly!
test_smi = 'CCC1=[O+]'
for idx in new_tokenizer.encode(test_smi).ids:
    print(f"{idx} --> {new_tokenizer.id_to_token(idx)}")

# > 16 --> C
# 16 --> C
# 16 --> C
# 20 --> 1
# 22 --> =
# 73 --> [O+]

Thanks @lianghsun! Did you test the performance with this change? What I’m worried about is if it was trained with for instance [O+] encoded the same as O, changing it now will only decrease its performance as it will never have seen a [O+] token.

Totally agree with you @Aron , if this pre-trained transformer model haven’t seen any [O+] in training phase, this may led the model less prone to predict [+O] at output. However, you can use the modified tokenizer to re-train the model to get better performance.

@Aron @lianghsun that answer works for most cases but still has a few edge cases slipping through. Br and Cl still wouldn’t tokenize right, along with a few connector tokens. Here is the regex that worked for me in the end:

“Cl|Br|%[0-9]{2}|>>|[(.*?)]|.”