I have an issue where a tokenizer doesn’t recognise tokens in its own vocabulary. A minimal example is:
from transformers import AutoTokenizer
model_checkpoint = ‘DeepChem/ChemBERTa-77M-MTR’
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
test_smiles = ‘CCC1=[O+]’
print(tokenizer.vocab[‘[O+]’])
print(tokenizer.tokenize(test_smiles))
this outputs:
73
[‘C’, ‘C’, ‘C’, ‘1’, ‘=’, ‘O’]
Notice that the '[O+]'
expression is encoded simply as 'O'
, even though it is in the vocabulary. This loses important information.
(also posted here as I’m not sure where exactly the issue is)