RAG Embeddings: German language

Hi,

I am new to the RAG approach and looked at many tutorials. For the Embedding Step I saw that many people are using the “sentence-transformers/all-MiniLM-L6-v2” from Huggingface, which works fine for English texts. Now I want to use RAG on German texts, so can I still use the Sentece Transformers Embedding or do I have to use a German one?

If so, what are good open-source embeddings for the german language?

Thanks in advance!

1 Like

Hi @mox
I just saw your post and i was wondering If you had come across something specific. Currently I am checking / experiementing with LeoLM/leo-mistral-hessianai-7b-chat · Hugging Face model and its applications for QA retrieval using llama index. I am also looking for the same answer as i think the accuracy depends heavily on the embedding model . Let me know your thoughts

1 Like

Hi Tim,

atm I am using intfloat/multilingual-e5-large which works okay. Also I wanted to check the JinaAI embeddings which look promising (“Jina Ich bin ein Berliner Embeddings”)

Hi @mox – a late answer.

I tested several different “smaller” embeddings before I start using possible Mistral-Embeddings and I stumbled upon “danielheinz/e5-base-sts-en-de”.

My “Use-Cases” are 100% for administrational documents – so there is a “huge” context (f.ex. check all german laws) which has to be embedded. I’ve tested (very very simple) the embeddings with 6 “short” searches with “synonyms” (~13.000 different lines of text) to find services of the administration.

Only with “e5-base-sts-en-de” I got 100% – failed RAG-embeddings-searches were the following:

  • multilingual-e5-base
  • paraphrase-multilingual-MiniLM-L12-v2
  • paraphrase-multilingual-mpnet-base-v2
  • gte-large
  • gbert-base

But this is only a momentary snapshot an is not representative.

3 Likes

Hi Marc,
we tried to use your recommended embedding model but we ran into an index out of range error:
IndexError: index out of range in self

according to this link this problem seems to be fixable by adjusting the vocab size:

Did you run in the same error and how did you fix it?

Thanks a lot!

Thanks a lot! This also helps me in our use case (regulations, energy market).

1 Like

Uh. No - sorry. I (gladly) did not stumble upon this problem … yet.

It’s not by accident the “BImSchG”? – “Bundesimmissionsschutzgesetz” :wink: (coz I wanted to test that as a PoC)

Thanks a lot for your reply!

Hi marc, all relevant laws, regulations and other documents in the field of green fuels will be covered. this also includes the BlmSchG :). If you are interested in talking, we can exchange our views.

1 Like

Hi, I also try to implement a RAG application for german data, in the energy / construction sector.
Do I also have to add the prefixes, “query:” and “passage:” with the fine-tuned model “e5-base-sts-en-de”?
Also what llm do you use for german and how did you specify your prompts?
I just adjusted the standard prompts in LlamaIndex with “Only answer in german”, with llama3 as llm it seems to mostly work.