Dealing with Decimal and Fractions

Hi Team,

I’m currently dealing with dataset which contains decimal and fraction and my task have big impact because of those numbers.

When I’m trying Bert based tokenizer, its tokenizing “1-1/2” as [1,-,1,/,2] but I want it as single token i.e [“1-1/2”]
Something similar happening with decimals too.

Please suggest the possible solutions to tokenize it properly.

Thanks
Ashish

Hi @amkorba, you should try retrain a Bert based tokenizer and using customized Split(). Let’s step by step to demostrate it :hugs:.

Step 1: Suppose you have a dataset (fraction.txt) likes below:

Hi 1-1/2 there
Lorem oweh 3/4

Step 2: Load pre-trained Bert tokenizer to HF Tokenizers

from tokenizers import Tokenizer
from tokenizers.pre_tokenizers import Split

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

Step 3: Define the patterns to split fraction

from tokenizers import Regex

pre_tokenizer = Split(
    pattern=Regex("( ([1-9]\d*\-) | ([1-9]\d*\/[1-9]\d*) | ([1-9]\d*\-[1-9]\d*\/[1-9]\d*) )"),
    behavior='isolated'
)
pre_tokenizer.pre_tokenize_str('Hi 1-1/2 there')
# > [('Hi', (0, 2)), (' 1-1/2 ', (2, 9)), ('there', (9, 14))] <-- it works!

Step 4: Re-train Bert tokenizer to learn fraction into your vocab!

from tokenizers.trainers import WordPieceTrainer
trainer = WordPieceTrainer(
    vocab_size=30522, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)
files = ['/path/2/fraction.txt']
tokenizer.train(files, trainer)

Step 5: Check the new trained tokenizer!

tokenizer.encode("Hi 1-1/2 there").tokens
# > ['[CLS]', 'hi', ' 1-1/2 ', 'ther', '##e', '[SEP]']