Train wordpiece from scratch

Kamel · September 9, 2021, 10:06am

Hi,
I am pre training a Bert model from scratch. For that I first need to train a wordpiece tokenizer, I am using BertWordPieceTokenizer for this.

My question:
Should I train the tokenizer on the whole corpus which is huge, or training it on a sample is enough?

Is there a way to tell the tokenizer to take train only on a sample?

Thanks.

nielsr · September 9, 2021, 11:41am

Yes. With HuggingFace Tokenizers, it takes seconds. From the README: “Takes less than 20 seconds to tokenize a GB of text on a server’s CPU”.

Kamel · September 9, 2021, 11:56am

Thanks again Nielsr

Topic		Replies	Views
Training BERT from scratch with Wikipedia + Book Corpus Dataset 🤗Transformers	1	4641	January 22, 2021
Training sentencePiece from scratch? 🤗Tokenizers	8	19259	December 19, 2023
Training a tokenizer Beginners	1	445	August 3, 2022
Data preprocessing steps for pretraining BERT from scratch Beginners	1	3872	January 30, 2022
Using a BertWordPieceTokenizer trained from scratch from transformers 🤗Tokenizers	2	4993	March 26, 2021