Avoid creating certain tokens when training a tokenizer

Hello.
There is a way to train a tokenizer from scratch, but force the algorithm (lets say, WordPiece) to not include in the vocabulary some tokens based on a given rule?

For instance, I would like to
Lets say we have the words:

['rond么nia', 'Rond么nia', 'ROND脭NIA']

Training a lowercase wordpiece tokenizer on my dataset tokenizes them as ['rond么n', '##ia']

However, i tryed something different. I preprocess them (using a custom pretokenizer) into, respectively:
['rond么nia', 'rond么nia_FU', 'rond么nia_U']

I tryed tokenizing, but I got
['rond么n', '##ia'], ['rond么n', '##ia_FU'], ['rond么n', '##ia_U']

Is not what I expected. I expect :
['rond么n', '##ia'], ['rond么n', '##ia', '##_FU'], ['rond么n', '##ia', '##_FU']
(or either ['rond么nia'], ['rond么nia', '##_FU'], ['rond么nia', '##_FU'])

I want to train the tokenizer in such way the 鈥榑FU鈥 and 鈥榑U鈥 to not being part of tokens, to be an exception, only when finding these terms in the words it ignores and break them separately, like they were something like ponctuations鈥

It is possible? There is such option when training a (wordpiece) tokenizer?


Plan B

If not, it is possible to remove words from vocabulary?
I mean, I could remove every token that ends in _FU (except 鈥##_FU鈥). By doing this, it will force the tokenizer to tokenize as I want because it wont find any tokens larger than 鈥##_FU鈥 when tokenizing words that ends with _FU, because they wont be on the vocabulary anymore.
Makes sense?
Is it possible to remove words from vocabulary?


Side note: I didnt save a vocabulary in disk, it seems it is on the memory while running the code.
As it is a customized tokenizer, it wont let me save it:
tok_wpc.save("tokenizer.json")
Returns the error:

Exception: Custom PreTokenizer cannot be serialized

So I cant save externally, neither see the generated vocabulary鈥

Tks in advance.