I’ve been using the transformers libraries for the last several months in creative generative text projects. I’m just a hobbyist – I mostly understand how everything works abstractly, but definitely don’t have a firm grasp on the underlying math. I started out by fine-tuning GPT-2, but lately I’ve been playing around with training models from scratch (e.g. arpabet, where “phonetic english” is represented as “fA0nE1tI0k I1GglI0S”), and I’m looking for resources/tips/pointers for how the various parameters will affect the models and their output.
I started my from-scratch model training by loading a GPT2LMHeadModel with the pre-trained gpt2 config, and that worked great. But I figured I could get better results if I went bigger. I tried training again using the gpt2-xl config, but the memory requirements were too high. So now I’m dialing in the parameters based on gpt2-large.
I’ve noticed that with all other values the same, I can fit the model into memory with the default 36 layers, 20 heads, but also something like 18 layers and 64 heads. I’d experiment with these (and other) values, but since training for 3 epochs through my data set would take weeks at a time, I was wondering if anyone could point me in the right direction as to what the tradeoffs are between number of layers, hidden, attention heads, etc.
Thanks!