Resources for model design (number of layers, attention heads, etc)

jncasey · January 4, 2021, 8:18pm

Thanks for the link!

From what can I understand, this paper outlines their optimization methods and the performance of the resulting optimized model across a battery of standard metrics. That’s super interesting and useful, but what I’m after is a little more basic. And it may not even exist at all.

I was hoping for some general rules of thumb for language model design, e.g. “more attention heads make the model better at picking up on grammatical patterns” or “having more hidden layers slows down overfitting during training” or even “there are diminishing returns with more than 12 attention heads”

It’s probably wishful thinking on my part that language model design is so simple.

Topic	Replies	Views
How to create custom GPT-2 model with different number of attention heads in different layers? 🤗Transformers	390	July 17, 2023
Understanding attention output from generate method in GPT model Beginners	599	November 8, 2023
I need help getting more accurate results after training Beginners	54	August 25, 2024
Task-specific fine-tuning of GPT2 Research	1045	April 22, 2021
Model Recommendations Beginners	1162	January 4, 2023

Resources for model design (number of layers, attention heads, etc)

Related topics