Is it safe to assume tokenizer does not change after initialization?

youkaichao · March 30, 2024, 3:46am

Hi, I find many attributes in tokenizer are very expensive to compute, e.g.

    def __len__(self):
        """
        Size of the full vocabulary with the added tokens. Counts the `keys` and not the `values` because otherwise if
        there is a hole in the vocab, we will add tokenizers at a wrong index.
        """
        return len(set(self.get_vocab().keys()))

Therefore, I want to cache the results of some attributes on tokenizer. I do see some functions that can change a tokenizer, but I find they are only used during initialization, and these methods are marked as private by an _ prefix, e.g. _update_trie/_add_tokens .

Is it safe to assume a tokenizer does not change after initialization? If not, what are some typical use cases to change a tokenizer after initialization?

Topic		Replies	Views
Does tokenizer changed during model training Beginners	2	1129	August 11, 2022
T51.1 vocab seems to inlcude added tokens? Beginners	0	66	May 7, 2024
What exacly is changed in the tokenizer after training it? Beginners	0	458	July 28, 2022
What will happen when add token to tokenizer Beginners	0	323	January 3, 2022
How to cache tokenization for the data Beginners	2	821	January 16, 2024

Is it safe to assume tokenizer does not change after initialization?

Related topics