91.7% on MMLU with Llama 3.1 405B AWQ 4-bit

theothernet · February 13, 2025, 12:20pm

We achieved 91.7% accuracy on the MMLU benchmark (100k questions) using a simple two-stage zero-shot prompting strategy we call TTR (Think Then Respond):

Implementation is straightforward - just two prompts (with the final prompt including the generated thoughts):

thoughtPrompt = "How should you best think about this? Explain your thought process step by step." 

outputPrompt = "Output only a single digit representing your choice (with no additional commentary)"

This exceeds more complex approaches like DeepSeek R1’s 90.8%, which requires 64 sampling attempts per question (for pass@1).

We used the hugging-quants 4-bit quantized version of Meta Llama 3.1 405B with VLLM for inference.

Open source: https://github.com/the-othernet/ttr-prompting

Topic		Replies	Views
LLaMa3.1 8B Instruct Prompt Tuning for Text Classification doesn't improve test accuracy Models	3	784	October 1, 2024
The fastest LLM inference on the server Research	0	397	August 8, 2024
Discrepancy Between Theoretical and Measured FLOPs/token for LLaMA-4 Scout 17B (MoE) Models	0	59	April 23, 2025
meta-llama/Llama-2-7b-chat-hf weird responses, compared to the ones returned by the HF API 🤗Transformers	1	115	February 2, 2025
How to batch process 5mm prompts of llama 2 using inference endpoints? Inference Endpoints on the Hub	0	1323	July 30, 2023

91.7% on MMLU with Llama 3.1 405B AWQ 4-bit

Related topics