I cannot find SentencePieceBPETokenizer
in the official documentation, therefore I ask this question.
I refer to the source code of SentencePieceBPETokenizer
to write my codes to train a custom Tokenizer based on SentencePieceBPETokenizer
.
Question 1: do the following codes make sense?
from tokenizers import SentencePieceBPETokenizer
special_tokens = ["<unk>", "<pad>", "<cls>", "<sep>", "<mask>"]
tokenizer = SentencePieceBPETokenizer()
tokenizer.train_from_iterator(
get_training_corpus(),
vocab_size = 400000,
min_frequency = 10,
show_progress = True,
special_tokens = special_tokens
)
where get_training_corpus()
is a function to get sentences in batch from a list of strings.
I notice that there are Trainer
classes, but there is no SentencePiece
. Also, from the Tokenizer
documentation, it specifies the attributes of train_from_iterator()
are iterator
and trainer
object, but my attributes follow the Trainer
class instead.
Question 2: No error occurred, can I assume the attributes are correct?
Question 3: What is the use of initial_alphabet
and limit_alphabet
? The documentation does not tell clearly. Any example?
p.s. I know show_progress = True
does not work in Jupyter notebook.
p.s. using tokenizers 0.13.3