RAG Embeddings: German language

Hi,

I am new to the RAG approach and looked at many tutorials. For the Embedding Step I saw that many people are using the “sentence-transformers/all-MiniLM-L6-v2” from Huggingface, which works fine for English texts. Now I want to use RAG on German texts, so can I still use the Sentece Transformers Embedding or do I have to use a German one?

If so, what are good open-source embeddings for the german language?

Thanks in advance!

Hi @mox
I just saw your post and i was wondering If you had come across something specific. Currently I am checking / experiementing with LeoLM/leo-mistral-hessianai-7b-chat · Hugging Face model and its applications for QA retrieval using llama index. I am also looking for the same answer as i think the accuracy depends heavily on the embedding model . Let me know your thoughts

1 Like

Hi Tim,

atm I am using intfloat/multilingual-e5-large which works okay. Also I wanted to check the JinaAI embeddings which look promising (“Jina Ich bin ein Berliner Embeddings”)

Hi @mox – a late answer.

I tested several different “smaller” embeddings before I start using possible Mistral-Embeddings and I stumbled upon “danielheinz/e5-base-sts-en-de”.

My “Use-Cases” are 100% for administrational documents – so there is a “huge” context (f.ex. check all german laws) which has to be embedded. I’ve tested (very very simple) the embeddings with 6 “short” searches with “synonyms” (~13.000 different lines of text) to find services of the administration.

Only with “e5-base-sts-en-de” I got 100% – failed RAG-embeddings-searches were the following:

  • multilingual-e5-base
  • paraphrase-multilingual-MiniLM-L12-v2
  • paraphrase-multilingual-mpnet-base-v2
  • gte-large
  • gbert-base

But this is only a momentary snapshot an is not representative.