How to pretrain randomized language model with custom dataset

I have randomized the weights of a causal language model (gpt2 to be specific). I want to pre-train the the model using my own data for next token prediction. What is the best way to do this? Are there any notebooks online which are useful