BERT WordPiece Tokenizer: some matras missing after tokenization for Hindi Language #572

akshat311 · December 23, 2020, 12:39pm

We trained our Bert WordPiece Tokenizer using the following dataset : https://drive.google.com/file/d/12MbWKERa7QPfI9F-xnMgIgAhjbN8EaQK/view?usp=sharing

The words ending with certain matras (eg. ए - े ) are missing these matras in the tokens.
For Eg : for the sentence "अपने पोस्ट ऑफिस में 420 पदों पर भर्ती "
the tokens were as follows : [‘अपन’, ‘पोस्ट’ , ‘ऑफिस’ , ‘म’ , ‘420’ , ‘पदो’ , ‘पर’ , ‘भर्ती’ ]
The first word (अपने) and third word (में ) are missing the ए ki matra after tokenization.

Even for the pretrained huggingface tokenizers, all the uncased tokenizers have the exact same issue. Words ending with ए ki matra are missing the matra after tokenization. However, cased pretrained tokenizers are working fine. (“bert-base-multilingual-cased” is working perfectly fine, however, “bert-base-multilingual-uncased” has the same issue mentioned above.)
Tokenization result for “bert-base-multilingual-cased” : [[‘अपने’, ‘प’, ‘##ो’, ‘##स्ट’, ‘ऑफ’, ‘##िस’, ‘में’, ‘420’, ‘##0’, ‘पद’, ‘##ों’, ‘पर’, ‘भर’, ‘##्ती’]
Tokenization result for “bert-base-multilingual-uncased” : [‘अपन’, ‘प’, ‘##ो’, ‘##सट’, ‘ऑफ’, ‘##िस’, ‘म’, ‘420’, ‘##0’, ‘पद’, ‘##ो’, ‘पर’, ‘भर’, ‘##ती’]

Why are these matras getting omitted after tokenization for our own tokenizer and the uncased bert tokenizers?

Topic		Replies	Views
Access word piece tokens from BERT tokenized dataset 🤗Datasets	2	930	November 17, 2021
BertTokenizer.decode not understanding new vocabulary 🤗Tokenizers	0	348	December 1, 2023
SentencePiece tokenizer Beginners	2	118	February 22, 2025
Strange shap analysis for text classification with BERT Beginners	10	874	September 17, 2024
Doubts about the tokenization strategy and the explanation of models through SHAP 🤗Tokenizers	0	227	May 22, 2024

BERT WordPiece Tokenizer: some matras missing after tokenization for Hindi Language #572

Related topics