What model checkpoint do I use if I trained a Word Piece tokenizer?

anon58275033 · August 15, 2021, 4:39pm

Hi,

I have just trained my own tokenizer from scratch, which is a Word Piece model like BERT, and I have saved it.

From there, I am now wanting to train my own language model from scratch using the tokenizer I trained beforehand.

However, referring to the code below, what do I change my model_checkpoint to?

model_checkpoint = "gpt2"
tokenizer_checkpoint = "drive/wordpiece-like-tokenizer"

I trained a Word Piece model like BERT, so should gpt2 be changed to something else?

Thanks.

sgugger · August 30, 2021, 6:08pm

You should change it to "bert-base-cased" for instance.

anon58275033 · August 30, 2021, 6:53pm

Thanks. Also, how much training and test data do you recommend for training a language model from scratch?

Topic		Replies	Views
Train wordpiece from scratch 🤗Tokenizers	2	1436	September 9, 2021
GPT2 Training from scratch in German 🤗Transformers	3	2312	October 3, 2020
Fine-tune, or train from scratch? Beginners	6	3454	September 16, 2020
Questions about the connection between tokenizer and the model Beginners	0	308	September 19, 2023
Training GPT-2 from scratch Beginners	2	1230	August 3, 2020