Documentation of SentencePieceBPETokenizer?

raptorkwok · May 2, 2024, 6:48am

I cannot find SentencePieceBPETokenizer in the official documentation, therefore I ask this question.

I refer to the source code of SentencePieceBPETokenizer to write my codes to train a custom Tokenizer based on SentencePieceBPETokenizer.

Question 1: do the following codes make sense?

from tokenizers import SentencePieceBPETokenizer

special_tokens = ["<unk>", "<pad>", "<cls>", "<sep>", "<mask>"]
tokenizer = SentencePieceBPETokenizer()
tokenizer.train_from_iterator(
    get_training_corpus(),
    vocab_size = 400000,
    min_frequency = 10, 
    show_progress = True,
    special_tokens = special_tokens
)

where get_training_corpus() is a function to get sentences in batch from a list of strings.

I notice that there are Trainer classes, but there is no SentencePiece. Also, from the Tokenizer documentation, it specifies the attributes of train_from_iterator() are iterator and trainer object, but my attributes follow the Trainer class instead.

Question 2: No error occurred, can I assume the attributes are correct?

Question 3: What is the use of initial_alphabet and limit_alphabet? The documentation does not tell clearly. Any example?

p.s. I know show_progress = True does not work in Jupyter notebook.
p.s. using tokenizers 0.13.3

Topic		Replies	Views
Training sentencePiece from scratch? 🤗Tokenizers	8	19239	December 19, 2023
Average time to train a SentencePieceBPETokenizer 🤗Tokenizers	0	559	September 13, 2022
How would you train a sentencepiece BPE tokenizer on this language with 400 "characters"? 🤗Tokenizers	0	2967	February 13, 2022
Tokenizer from tokenizers library cannot be used in transformers.Trainer 🤗Transformers	2	625	July 30, 2021
Which file stores token frequency in SentencePieceBPETokenizer? 🤗Tokenizers	0	170	May 3, 2024

Documentation of SentencePieceBPETokenizer?

Related topics