Hi, I find many attributes in tokenizer
are very expensive to compute, e.g.
def __len__(self):
"""
Size of the full vocabulary with the added tokens. Counts the `keys` and not the `values` because otherwise if
there is a hole in the vocab, we will add tokenizers at a wrong index.
"""
return len(set(self.get_vocab().keys()))
Therefore, I want to cache the results of some attributes on tokenizer
. I do see some functions that can change a tokenizer, but I find they are only used during initialization, and these methods are marked as private by an _
prefix, e.g. _update_trie/_add_tokens
.
Is it safe to assume a tokenizer does not change after initialization? If not, what are some typical use cases to change a tokenizer after initialization?