Avoid creating certain tokens when training a tokenizer

jonathanalis · July 26, 2022, 2:49am

Hello.
There is a way to train a tokenizer from scratch, but force the algorithm (lets say, WordPiece) to not include in the vocabulary some tokens based on a given rule?

For instance, I would like to
Lets say we have the words:

['rondônia', 'Rondônia', 'RONDÔNIA']

Training a lowercase wordpiece tokenizer on my dataset tokenizes them as ['rondôn', '##ia']

However, i tryed something different. I preprocess them (using a custom pretokenizer) into, respectively:
['rondônia', 'rondônia_FU', 'rondônia_U']

I tryed tokenizing, but I got
['rondôn', '##ia'], ['rondôn', '##ia_FU'], ['rondôn', '##ia_U']

Is not what I expected. I expect :
['rondôn', '##ia'], ['rondôn', '##ia', '##_FU'], ['rondôn', '##ia', '##_FU']
(or either ['rondônia'], ['rondônia', '##_FU'], ['rondônia', '##_FU'])

I want to train the tokenizer in such way the ‘_FU’ and ‘_U’ to not being part of tokens, to be an exception, only when finding these terms in the words it ignores and break them separately, like they were something like ponctuations…

It is possible? There is such option when training a (wordpiece) tokenizer?

Plan B

If not, it is possible to remove words from vocabulary?
I mean, I could remove every token that ends in _FU (except ‘##_FU’). By doing this, it will force the tokenizer to tokenize as I want because it wont find any tokens larger than ‘##_FU’ when tokenizing words that ends with _FU, because they wont be on the vocabulary anymore.
Makes sense?
Is it possible to remove words from vocabulary?

Side note: I didnt save a vocabulary in disk, it seems it is on the memory while running the code.
As it is a customized tokenizer, it wont let me save it:
tok_wpc.save("tokenizer.json")
Returns the error:

Exception: Custom PreTokenizer cannot be serialized

So I cant save externally, neither see the generated vocabulary…

Tks in advance.

Topic		Replies	Views
Word level tokenizer pulls special tokens out of pretokenized strings 🤗Tokenizers	3	18	July 4, 2025
WordPiece issue - behaves like WordLevel Beginners	0	331	March 22, 2022
Padding and truncation for custom tokenizer 🤗Tokenizers	1	643	January 22, 2023
How to save a tokenizer only consisting of added tokens 🤗Tokenizers	0	840	May 11, 2022
Many ambiguous unicode characters for trained tokenizer 🤗Tokenizers	0	374	December 31, 2023

Avoid creating certain tokens when training a tokenizer

Related topics