Dealing with Decimal and Fractions

Hi @amkorba, you should try retrain a Bert based tokenizer and using customized Split(). Let’s step by step to demostrate it :hugs:.

Step 1: Suppose you have a dataset (fraction.txt) likes below:

Hi 1-1/2 there
Lorem oweh 3/4

Step 2: Load pre-trained Bert tokenizer to HF Tokenizers

from tokenizers import Tokenizer
from tokenizers.pre_tokenizers import Split

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

Step 3: Define the patterns to split fraction

from tokenizers import Regex

pre_tokenizer = Split(
    pattern=Regex("( ([1-9]\d*\-) | ([1-9]\d*\/[1-9]\d*) | ([1-9]\d*\-[1-9]\d*\/[1-9]\d*) )"),
    behavior='isolated'
)
pre_tokenizer.pre_tokenize_str('Hi 1-1/2 there')
# > [('Hi', (0, 2)), (' 1-1/2 ', (2, 9)), ('there', (9, 14))] <-- it works!

Step 4: Re-train Bert tokenizer to learn fraction into your vocab!

from tokenizers.trainers import WordPieceTrainer
trainer = WordPieceTrainer(
    vocab_size=30522, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)
files = ['/path/2/fraction.txt']
tokenizer.train(files, trainer)

Step 5: Check the new trained tokenizer!

tokenizer.encode("Hi 1-1/2 there").tokens
# > ['[CLS]', 'hi', ' 1-1/2 ', 'ther', '##e', '[SEP]']
1 Like