Dealing with Decimal and Fractions

amkorba · September 20, 2022, 7:01am

Hi Team,

I’m currently dealing with dataset which contains decimal and fraction and my task have big impact because of those numbers.

When I’m trying Bert based tokenizer, its tokenizing “1-1/2” as [1,-,1,/,2] but I want it as single token i.e [“1-1/2”]
Something similar happening with decimals too.

Please suggest the possible solutions to tokenize it properly.

Thanks
Ashish

lianghsun · October 27, 2022, 8:56am

Hi @amkorba, you should try retrain a Bert based tokenizer and using customized Split(). Let’s step by step to demostrate it .

Step 1: Suppose you have a dataset (fraction.txt) likes below:

Hi 1-1/2 there
Lorem oweh 3/4

Step 2: Load pre-trained Bert tokenizer to HF Tokenizers

from tokenizers import Tokenizer
from tokenizers.pre_tokenizers import Split

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

Step 3: Define the patterns to split fraction

from tokenizers import Regex

pre_tokenizer = Split(
    pattern=Regex("( ([1-9]\d*\-) | ([1-9]\d*\/[1-9]\d*) | ([1-9]\d*\-[1-9]\d*\/[1-9]\d*) )"),
    behavior='isolated'
)
pre_tokenizer.pre_tokenize_str('Hi 1-1/2 there')
# > [('Hi', (0, 2)), (' 1-1/2 ', (2, 9)), ('there', (9, 14))] <-- it works!

Step 4: Re-train Bert tokenizer to learn fraction into your vocab!

from tokenizers.trainers import WordPieceTrainer
trainer = WordPieceTrainer(
    vocab_size=30522, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)
files = ['/path/2/fraction.txt']
tokenizer.train(files, trainer)

Step 5: Check the new trained tokenizer!

tokenizer.encode("Hi 1-1/2 there").tokens
# > ['[CLS]', 'hi', ' 1-1/2 ', 'ther', '##e', '[SEP]']

Topic		Replies	Views
Tokenizer splits up pre-split tokens 🤗Tokenizers	9	6662	February 9, 2024
Bert pretrained tokenizer: how to preserve hyphened words? Beginners	0	311	April 6, 2022
Tokenizer is splitting special token 🤗Tokenizers	3	19	June 30, 2025
Use a pretrained ByteLevelBPETokenizer on text 🤗Tokenizers	1	3772	July 17, 2020
Sentence splitting 🤗Tokenizers	7	31872	September 15, 2022

Dealing with Decimal and Fractions

Related topics