Which file stores token frequency in SentencePieceBPETokenizer?

raptorkwok · May 3, 2024, 6:52am

After training a Tokenizer, there are several files generated:

merges.txt
special_tokens_map.json
tokenizer.json
tokenizer_config.json
vocab.json

However, none of these store the frequency of the tokens found in the training dataset. Does it mean the training process does not store such data?

The reason for this question is that I found that some words that frequently appeared in the dataset are not included in the token list, while some words with fewer occurrences are included in the token list.

How does a SentencePieceBPETokenizer choose tokens from a dataset?

Topic		Replies	Views
Documentation of SentencePieceBPETokenizer? 🤗Tokenizers	0	815	May 2, 2024
Load SentencePieceBPETokenizer in TF 🤗Tokenizers	0	1001	April 27, 2022
Training sentencePiece from scratch? 🤗Tokenizers	8	19205	December 19, 2023
Why do different tokenizers use different vocab files? 🤗Transformers	0	1787	October 18, 2020
Loading SentencePiece tokenizer Beginners	3	4994	October 24, 2023

Which file stores token frequency in SentencePieceBPETokenizer?

Related topics