Custom huggingface Tokenizer with custom model for BERT

spiralarchitect · May 13, 2021, 4:17am

I am working on molecule data with representation called SMILES. an example molecule string looks like Cc1ccccc1N1C(=O)NC(=O)C(=Cc2cc(Br)c(N3CCOCC3)o2)C1=O.

Now, I want a custom Tokenizer which can be used with Huggingface transformer APIs. I also donot want to use the existing tokenizer models like BPE etc. I want the SMILES string parsed through regex to give individual characters as tokens as follows:

import re

SMI_REGEX_PATTERN = r"""(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|
#|-|\+|\\|\/|:|~|@|\?|>>?|\*|\$|\%[0-9]{2}|[0-9])"""

regex = re.compile(SMI_REGEX_PATTERN)

molecule = 'Cc1ccccc1N1C(=O)NC(=O)C(=Cc2cc(Br)c(N3CCOCC3)o2)C1=O'
tokens = regex.findall(molecule)

It is fairly simple to do the above, but I need a tokenizer which works with, let’s say BERT API of Huggingface. Also, I donot want to use lowercase conversion, but still use BERT.

The documentation here in quicktour doesn’t talk about creating custom model as far as I can see.

Any help is highly appreciated.

Topic		Replies	Views
How to create a Huggingface tokenizer from a non-Huggingface tokenizer? 🤗Tokenizers	0	519	May 4, 2021
Convert huggingface tokenizer into sentencepiece format 🤗Tokenizers	1	596	November 27, 2024
Machine Translation using Hugging Face problem Intermediate	0	323	May 8, 2023
How does one create a custom hugging face model with a already working tokenizer? 🤗Transformers	1	965	July 29, 2024
Defining a custom dataset for fine-tuning translation Beginners	4	5080	July 10, 2021

Custom huggingface Tokenizer with custom model for BERT

Related topics