Best Local LLM for Real-Time Q&A on German/English Transcript?

Hi everyone,

I’m looking for model recommendations for a local Python application.

My project: A desktop app that live-transcribes my PC audio (mostly German). I want to use a local LLM to ask questions about this transcript in real-time.

My key requirements are:

  • Integration: Must work directly with the standard transformers pipeline() on a consumer AMD CPU 9800x3D. I cannot use a separate server like vLLM/TGI.

  • Performance: I’m looking for models in the ~5B to 13B range that are fast enough for interactive chat.

  • Languages: The model must be strong in both German and English .

My research suggests that meta-llama/Meta-Llama-3.1-8B-Instruct is currently the best choice for this.

Is this my best option, or are there other more recent high-performing bilingual models (especially finetunes) that fit these constraints?

Thanks for any suggestions

1 Like

There doesn’t seem to be much new leaderboard data for German LLM, but within the available data, Llama 3.1 Instruct appears to be quite good. That model is generally well-designed. For 12B, Mistral Nemo might be a good option. Additionally, since Qwen 2’s scores aren’t bad, Qwen 2.5, which made significant progress compared to Qwen 2, and its successor Qwen 3, may also be promising.
For multilingual models, Gemma 2 and Gemma 3 are generally excellent.

Since the Hub now has a feature to search for models by size, it would be even better to find a version fine-tuned for German, regardless of which model you choose.