I’m working on a code output model and having to do something similar(ish), I’m stripping out all of the response unless the model begins with standard keywords like import, def, ```, etc. You could try something similar. Maybe make a system prompt always begin with an Answer that you then can splice out from the generation.
You could generate multiple outputs and only take the most common answer per question.
Other things to try: Use larger batch sizes via gradient accumulation and stable long-term gradient descent.
What would be cool is training the model to rederive the reasoning trace given the final answer and the question. That might be worth exploring if you got time.