Resources for model design (number of layers, attention heads, etc)

I’ve been using the transformers libraries for the last several months in creative generative text projects. I’m just a hobbyist – I mostly understand how everything works abstractly, but definitely don’t have a firm grasp on the underlying math. I started out by fine-tuning GPT-2, but lately I’ve been playing around with training models from scratch (e.g. arpabet, where “phonetic english” is represented as “fA0nE1tI0k I1GglI0S”), and I’m looking for resources/tips/pointers for how the various parameters will affect the models and their output.

I started my from-scratch model training by loading a GPT2LMHeadModel with the pre-trained gpt2 config, and that worked great. But I figured I could get better results if I went bigger. I tried training again using the gpt2-xl config, but the memory requirements were too high. So now I’m dialing in the parameters based on gpt2-large.

I’ve noticed that with all other values the same, I can fit the model into memory with the default 36 layers, 20 heads, but also something like 18 layers and 64 heads. I’d experiment with these (and other) values, but since training for 3 epochs through my data set would take weeks at a time, I was wondering if anyone could point me in the right direction as to what the tradeoffs are between number of layers, hidden, attention heads, etc.


This may not be exactly what you’re looking for, but this paper explores the performance impacts of different configurations of decoder, attention heads, hidden layers and intermediate layers.


Thanks for the link!

From what can I understand, this paper outlines their optimization methods and the performance of the resulting optimized model across a battery of standard metrics. That’s super interesting and useful, but what I’m after is a little more basic. And it may not even exist at all.

I was hoping for some general rules of thumb for language model design, e.g. “more attention heads make the model better at picking up on grammatical patterns” or “having more hidden layers slows down overfitting during training” or even “there are diminishing returns with more than 12 attention heads”

It’s probably wishful thinking on my part that language model design is so simple.