Train GPT2 on wikitext from scratch

Hello everyone,

I would like to train GPT2 on wikitext from scratch (not fine-tune pre-trained model). I launched the following script in this folder.

python run_clm.py
–model_type gpt2
–tokenizer_name gpt2
–block_size 256
–dataset_name wikitext
–dataset_config_name wikitext-2-raw-v1
–do_train
–do_eval
–overwrite_output_dir
–num_train_epochs 1
–output_dir /tmp/test-clm

Now I have two questions:
1- I was wondering if what I did is indeed a correct approach to train GPT2 from scratch?
2- I would like to know what hyperparameters I shoud use for this task? ( as far as I can tell, the suggested hyperparameters in existing examples in huggingface repo are for fine-tuning pre-trainned model)

I can confirm the command is correct if you want to train from scratch. As for hyperparameters, you will need to tune them a bit, but the defaults should not be too bad.

2 Likes

Thanks for your time and reply. I was also wondering how many epochs you suggest for training from scratch?

I was wondering if there is any update in this thread? I like to know how many epochs is suggested for training GPT-2 on wikitext from scratch?

I have a custom vocab and want to use the BertWordpieceTokenizer for this reason. How would I do that with this new run_clm.py script?

@Hamidreza did you find out how long GPT-2 takes to train from scratch?