Trained tokenizer API as PretrainedTokenizer

AfonsoSousa · September 26, 2022, 4:35pm

Hi. I have trained a BPE Tokenizer successfully using:

 tokenizer = Tokenizer(BPE(unk_token=unk_token))
 trainer = BpeTrainer(special_tokens=special_tokens)
 tokenizer.train(text_files, trainer)

But I lack things like tokenizer.pad_token to access the padding token or len(tokenizer) to get the vocabulary size. Is this the best way to train a BPE tokenizer? How can I get the same API as PretrainedTokenizer? Thanks in advance for any help you can provide.

lianghsun · October 25, 2022, 8:29pm

Not pretty sure what questions you met, so I just explain how to deal with padding and get vocab size here:

Padding: the reason we set [PAD] in training phase is to tell tokenizer that we might have a special token there, but we do not tell the function about each special token at moment. Then after the training you finally get a trained tokenizer instance, then you can specify tokenizer.enable_padding(pad_id=tokenizer.tokens_to_id('[PAD]')) to tell the tokenizer that we want to set [PAD] to the roll of the padding token which we have already setup in vocab list before. After setting up .enable_padding(), the tokenized sentense might contain padding.
Vocab size: tokenizer.get_vocab_size()

Topic		Replies	Views
Tokenizer.pad_token=what? 🤗Tokenizers	2	10107	November 8, 2022
BpeTrainer implementation in Python 🤗Tokenizers	0	376	July 23, 2021
Tokenized sequence lengths 🤗Tokenizers	6	2040	March 10, 2022
Training tokenizers with padding in between tokens 🤗Tokenizers	0	381	October 19, 2023
Use a pretrained ByteLevelBPETokenizer on text 🤗Tokenizers	1	3772	July 17, 2020

Trained tokenizer API as PretrainedTokenizer

Related topics