Or rather, I suspect it might be an issue with the initial weights of Gemma 2 or a bug in the Hugging Face implementation. For example, there’s a pre-trained model called SystemGemma based on Gemma 2.
Additionally, based on my experience, the 2B and 9B models of Gemma 2 when used with Llama.cpp are highly faithful to the system prompt. This makes me think something is off. I haven’t tried the 27B model yet.
As for models with similar sizes but fewer quirks, there’s Qwen 2.5 32B Coder. QwQ is also excellent, but it’s a reasoning model, so it’s full of quirks. If you’re looking for more diverse models, you can search by size on the leaderboard.