How to handle parenthesis, quotation marks, \n etc when creating tokenizer from scratch

jonathanalis · June 26, 2022, 6:15am

Hello,
I have a large corpus from a specific domain.
We decided that it was better to create a tokenizer from scratch, there was too many important tokens that was’nt in the general language tokenizer vocabulary.
We use SentensePieceBPETokenizer.

However, in my first tests, odd things happened.
Lets consider the generic token “token”. It and its capitalized variations apear on the vocabulary.
However, also “(token”, ““token”, “\ntoken”, “\n\ntoken”, “\n\n\ntoken”, “\n(token”, “.\nToken” (parentesis, quotation mark, a sequence of 1 or more new line (\n), and their combination. Even period, new line and the token ) apear on the vocabulary.

Therefore, the tokenizer is not segmenting the \n, quotation marks, parenthesis, etc from the tokens. The tokenizer is adding more words with the variations of the tokens with these annoying characters. So, words that have the same meaning are considered other words in the vocabulary.

How should I handle this problem? And how to do it using Tokenizer class?

For now, my code is going like this:

Blockquote
tokenizer = SentencePieceBPETokenizer()
tokenizer.normalizer=normalizers.Sequence( [normalizers.NFD() ] )
tokenizer.train_from_iterator(
dataset[‘text’],
vocab_size=60_000
min_frequency=5,
show_progress=True,
limit_alphabet=500,
)

I am aware that I could include the special tokens as the parameter special_tokens on the train_from_iterator function of the tokenizer, for instance:

Blockquote
special_tokens=[‘[PAD]’, ‘[UNK]’, ‘[CLS]’, ‘[SEP]’, ‘[MASK]’]

But how to include more special tokens, and attribute specific characters (like newline, quotation mark, parenthesis) to them?
Also, if doing so, these tokens are treated as segmented from other tokens when building a tokenizer? Will it solve my problem?

Thank you

Topic		Replies	Views
Avoid creating certain tokens when training a tokenizer 🤗Tokenizers	0	602	July 26, 2022
Tokenizer is splitting special token 🤗Tokenizers	3	18	June 30, 2025
Using HuggingFace Tokenizers Without Special Characters 🤗Tokenizers	2	1943	November 2, 2022
Adding atomic / indivisible tokens to BPE tokenizer 🤗Tokenizers	8	31	July 3, 2025
SentencePiece tokenizer Beginners	2	138	February 22, 2025

How to handle parenthesis, quotation marks, \n etc when creating tokenizer from scratch

Related topics