I am working on molecule data with representation called SMILES. an example molecule string looks like Cc1ccccc1N1C(=O)NC(=O)C(=Cc2cc(Br)c(N3CCOCC3)o2)C1=O
.
Now, I want a custom Tokenizer
which can be used with Huggingface transformer APIs. I also donot want to use the existing tokenizer models like BPE
etc. I want the SMILES string parsed through regex to give individual characters as tokens as follows:
import re
SMI_REGEX_PATTERN = r"""(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|
#|-|\+|\\|\/|:|~|@|\?|>>?|\*|\$|\%[0-9]{2}|[0-9])"""
regex = re.compile(SMI_REGEX_PATTERN)
molecule = 'Cc1ccccc1N1C(=O)NC(=O)C(=Cc2cc(Br)c(N3CCOCC3)o2)C1=O'
tokens = regex.findall(molecule)
It is fairly simple to do the above, but I need a tokenizer which works with, let’s say BERT
API of Huggingface. Also, I donot want to use lowercase conversion, but still use BERT.
The documentation here in quicktour doesn’t talk about creating custom model as far as I can see.
Any help is highly appreciated.