I’m currently dealing with dataset which contains decimal and fraction and my task have big impact because of those numbers.

When I’m trying Bert based tokenizer, its tokenizing “1-1/2” as [1,-,1,/,2] but I want it as single token i.e [“1-1/2”]
Something similar happening with decimals too.

Please suggest the possible solutions to tokenize it properly.

Hi @amkorba, you should try retrain a Bert based tokenizer and using customized `Split()`. Let’s step by step to demostrate it .

Step 1: Suppose you have a dataset (fraction.txt) likes below:

``````Hi 1-1/2 there
Lorem oweh 3/4
``````

Step 2: Load pre-trained Bert tokenizer to HF Tokenizers

``````from tokenizers import Tokenizer
from tokenizers.pre_tokenizers import Split

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
``````

Step 3: Define the patterns to split fraction

``````from tokenizers import Regex

pre_tokenizer = Split(
pattern=Regex("( ([1-9]\d*\-) | ([1-9]\d*\/[1-9]\d*) | ([1-9]\d*\-[1-9]\d*\/[1-9]\d*) )"),
behavior='isolated'
)
pre_tokenizer.pre_tokenize_str('Hi 1-1/2 there')
# > [('Hi', (0, 2)), (' 1-1/2 ', (2, 9)), ('there', (9, 14))] <-- it works!
``````

Step 4: Re-train Bert tokenizer to learn fraction into your vocab!

``````from tokenizers.trainers import WordPieceTrainer
trainer = WordPieceTrainer(
