Resources for model design (number of layers, attention heads, etc)

jncasey · December 30, 2020, 7:41pm

I’ve been using the transformers libraries for the last several months in creative generative text projects. I’m just a hobbyist – I mostly understand how everything works abstractly, but definitely don’t have a firm grasp on the underlying math. I started out by fine-tuning GPT-2, but lately I’ve been playing around with training models from scratch (e.g. arpabet, where “phonetic english” is represented as “fA0nE1tI0k I1GglI0S”), and I’m looking for resources/tips/pointers for how the various parameters will affect the models and their output.

I started my from-scratch model training by loading a GPT2LMHeadModel with the pre-trained gpt2 config, and that worked great. But I figured I could get better results if I went bigger. I tried training again using the gpt2-xl config, but the memory requirements were too high. So now I’m dialing in the parameters based on gpt2-large.

I’ve noticed that with all other values the same, I can fit the model into memory with the default 36 layers, 20 heads, but also something like 18 layers and 64 heads. I’d experiment with these (and other) values, but since training for 3 epochs through my data set would take weeks at a time, I was wondering if anyone could point me in the right direction as to what the tradeoffs are between number of layers, hidden, attention heads, etc.

Thanks!

FL33TW00D · December 31, 2020, 10:47am

This may not be exactly what you’re looking for, but this paper explores the performance impacts of different configurations of decoder, attention heads, hidden layers and intermediate layers.

jncasey · January 4, 2021, 8:18pm

Thanks for the link!

From what can I understand, this paper outlines their optimization methods and the performance of the resulting optimized model across a battery of standard metrics. That’s super interesting and useful, but what I’m after is a little more basic. And it may not even exist at all.

I was hoping for some general rules of thumb for language model design, e.g. “more attention heads make the model better at picking up on grammatical patterns” or “having more hidden layers slows down overfitting during training” or even “there are diminishing returns with more than 12 attention heads”

It’s probably wishful thinking on my part that language model design is so simple.

Topic		Replies	Views
How to create custom GPT-2 model with different number of attention heads in different layers? 🤗Transformers	0	397	July 17, 2023
Bert Config: Num attention heads 🤗Transformers	2	1125	March 7, 2023
BERT model size (transformer block number) Beginners	4	3573	August 21, 2020
How do GPT2 pretrained models allow custom hyperparams? Intermediate	0	355	March 10, 2021
Perplexity from fine-tuned GPT2LMHeadModel with and without lm_head as a parameter Intermediate	4	2062	May 10, 2022

Resources for model design (number of layers, attention heads, etc)

Related topics