Train wordpiece from scratch

I am pre training a Bert model from scratch. For that I first need to train a wordpiece tokenizer, I am using BertWordPieceTokenizer for this.

My question:
Should I train the tokenizer on the whole corpus which is huge, or training it on a sample is enough?

Is there a way to tell the tokenizer to take train only on a sample?


1 Like

Yes. With HuggingFace Tokenizers, it takes seconds. From the README: “Takes less than 20 seconds to tokenize a GB of text on a server’s CPU”.

1 Like

Thanks again Nielsr