Avoid creating certain tokens when training a tokenizer

Hello.
There is a way to train a tokenizer from scratch, but force the algorithm (lets say, WordPiece) to not include in the vocabulary some tokens based on a given rule?

For instance, I would like to
Lets say we have the words:

['rondônia', 'Rondônia', 'RONDÔNIA']

Training a lowercase wordpiece tokenizer on my dataset tokenizes them as ['rondôn', '##ia']

However, i tryed something different. I preprocess them (using a custom pretokenizer) into, respectively:
['rondônia', 'rondônia_FU', 'rondônia_U']

I tryed tokenizing, but I got
['rondôn', '##ia'], ['rondôn', '##ia_FU'], ['rondôn', '##ia_U']

Is not what I expected. I expect :
['rondôn', '##ia'], ['rondôn', '##ia', '##_FU'], ['rondôn', '##ia', '##_FU']
(or either ['rondônia'], ['rondônia', '##_FU'], ['rondônia', '##_FU'])

I want to train the tokenizer in such way the ‘_FU’ and ‘_U’ to not being part of tokens, to be an exception, only when finding these terms in the words it ignores and break them separately, like they were something like ponctuations…

It is possible? There is such option when training a (wordpiece) tokenizer?


Plan B

If not, it is possible to remove words from vocabulary?
I mean, I could remove every token that ends in _FU (except ‘##_FU’). By doing this, it will force the tokenizer to tokenize as I want because it wont find any tokens larger than ‘##_FU’ when tokenizing words that ends with _FU, because they wont be on the vocabulary anymore.
Makes sense?
Is it possible to remove words from vocabulary?


Side note: I didnt save a vocabulary in disk, it seems it is on the memory while running the code.
As it is a customized tokenizer, it wont let me save it:
tok_wpc.save("tokenizer.json")
Returns the error:

Exception: Custom PreTokenizer cannot be serialized

So I cant save externally, neither see the generated vocabulary…

Tks in advance.