Hi everyone,
I’m looking for model recommendations for a local Python application.
My project: A desktop app that live-transcribes my PC audio (mostly German). I want to use a local LLM to ask questions about this transcript in real-time.
My key requirements are:
-
Integration: Must work directly with the standard transformers pipeline() on a consumer AMD CPU 9800x3D. I cannot use a separate server like vLLM/TGI.
-
Performance: I’m looking for models in the ~5B to 13B range that are fast enough for interactive chat.
-
Languages: The model must be strong in both German and English .
My research suggests that meta-llama/Meta-Llama-3.1-8B-Instruct is currently the best choice for this.
Is this my best option, or are there other more recent high-performing bilingual models (especially finetunes) that fit these constraints?
Thanks for any suggestions