What model checkpoint do I use if I trained a Word Piece tokenizer?

Hi,

I have just trained my own tokenizer from scratch, which is a Word Piece model like BERT, and I have saved it.

From there, I am now wanting to train my own language model from scratch using the tokenizer I trained beforehand.

However, referring to the code below, what do I change my model_checkpoint to?

model_checkpoint = "gpt2"
tokenizer_checkpoint = "drive/wordpiece-like-tokenizer"

I trained a Word Piece model like BERT, so should gpt2 be changed to something else?

Thanks.

You should change it to "bert-base-cased" for instance.

1 Like

Thanks. Also, how much training and test data do you recommend for training a language model from scratch?