Hi @amkorba, you should try retrain a Bert based tokenizer and using customized Split()
. Let’s step by step to demostrate it
.
Step 1: Suppose you have a dataset (fraction.txt) likes below:
Hi 1-1/2 there
Lorem oweh 3/4
Step 2: Load pre-trained Bert tokenizer to HF Tokenizers
from tokenizers import Tokenizer
from tokenizers.pre_tokenizers import Split
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
Step 3: Define the patterns to split fraction
from tokenizers import Regex
pre_tokenizer = Split(
pattern=Regex("( ([1-9]\d*\-) | ([1-9]\d*\/[1-9]\d*) | ([1-9]\d*\-[1-9]\d*\/[1-9]\d*) )"),
behavior='isolated'
)
pre_tokenizer.pre_tokenize_str('Hi 1-1/2 there')
# > [('Hi', (0, 2)), (' 1-1/2 ', (2, 9)), ('there', (9, 14))] <-- it works!
Step 4: Re-train Bert tokenizer to learn fraction into your vocab!
from tokenizers.trainers import WordPieceTrainer
trainer = WordPieceTrainer(
vocab_size=30522, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)
files = ['/path/2/fraction.txt']
tokenizer.train(files, trainer)
Step 5: Check the new trained tokenizer!
tokenizer.encode("Hi 1-1/2 there").tokens
# > ['[CLS]', 'hi', ' 1-1/2 ', 'ther', '##e', '[SEP]']