Hi @Aron, the reason why [O+] is encoded as O may be the BPE encoding, but I haven’t found the way to correct it. However, I find an alternative way to solve this problem.
# Step1: Save the vocab.json from ChemBERTa pretrained model
from transformers import AutoTokenizer
model_checkpoint = 'DeepChem/ChemBERTa-77M-MTR'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
tokenizer.save_pretrained('/path/to/deepchem')
# Step2: Use WordLevel Model
from tokenizers.models import WordLevel
new_tokenizer = Tokenizer(
WordLevel.from_file(
'/path/to/deepchem/vocab.json',
unk_token='[UNK]'
))
# Step3: (important) Set pretokenizer to split the SMILES character
from tokenizers import Regex
pre_tokenizer = Split(
pattern=Regex("\[(.*?)\]|.*?"),
behavior='isolated'
)
pre_tokenizer.pre_tokenize_str('CCC1=[O+]') # You can test it with this line
new_tokenizer.pre_tokenizer = pre_tokenizer
# Step4: Check if tokenizer work correctly!
test_smi = 'CCC1=[O+]'
for idx in new_tokenizer.encode(test_smi).ids:
print(f"{idx} --> {new_tokenizer.id_to_token(idx)}")
# > 16 --> C
# 16 --> C
# 16 --> C
# 20 --> 1
# 22 --> =
# 73 --> [O+]
Thanks @lianghsun! Did you test the performance with this change? What I’m worried about is if it was trained with for instance [O+] encoded the same as O, changing it now will only decrease its performance as it will never have seen a [O+] token.
Totally agree with you @Aron , if this pre-trained transformer model haven’t seen any [O+] in training phase, this may led the model less prone to predict [+O] at output. However, you can use the modified tokenizer to re-train the model to get better performance.
@Aron@lianghsun that answer works for most cases but still has a few edge cases slipping through. Br and Cl still wouldn’t tokenize right, along with a few connector tokens. Here is the regex that worked for me in the end: