After training a Tokenizer, there are several files generated:
- merges.txt
- special_tokens_map.json
- tokenizer.json
- tokenizer_config.json
- vocab.json
However, none of these store the frequency of the tokens found in the training dataset. Does it mean the training process does not store such data?
The reason for this question is that I found that some words that frequently appeared in the dataset are not included in the token list, while some words with fewer occurrences are included in the token list.
How does a SentencePieceBPETokenizer
choose tokens from a dataset?