Hi @iamnotapenguin, the place I would start is by adapting the following script for causal language modelling to your dataset: transformers/run_clm.py at master · huggingface/transformers · GitHub
This script allows you to specify both the tokenizer and the model architecture, plus you can do multi-gpu training which is advisable if you’re training from scratch.
Hope that helps!