Best LLM to pretrain?

Hi, I am working on a project where I am going to pre-train an LLM on a constrained, non-language domain (thus necessitating pre-training) that there is a lot of data for and then fine-tuning it with DPO based on pairs constructed from a supervised task. My question is—is there a particular LLM architecture that would be best to choose? I plan to have my model have somewhere around 20 million parameters if that makes a difference.

It seems the default model architecture to choose is GPT-2, but given its age I wonder if there are any that are better choices—more efficient to train, more parameter-efficient, “smarter,” etc. I was thinking of choosing an architecture with rotary positional embeddings at least.