Thanks for the link!
From what can I understand, this paper outlines their optimization methods and the performance of the resulting optimized model across a battery of standard metrics. That’s super interesting and useful, but what I’m after is a little more basic. And it may not even exist at all.
I was hoping for some general rules of thumb for language model design, e.g. “more attention heads make the model better at picking up on grammatical patterns” or “having more hidden layers slows down overfitting during training” or even “there are diminishing returns with more than 12 attention heads”
It’s probably wishful thinking on my part that language model design is so simple.