RAG Embeddings: German language

mox · November 2, 2023, 3:05pm

Hi,

I am new to the RAG approach and looked at many tutorials. For the Embedding Step I saw that many people are using the “sentence-transformers/all-MiniLM-L6-v2” from Huggingface, which works fine for English texts. Now I want to use RAG on German texts, so can I still use the Sentece Transformers Embedding or do I have to use a German one?

If so, what are good open-source embeddings for the german language?

Thanks in advance!

Tim1785 · December 5, 2023, 2:35pm

Hi @mox
I just saw your post and i was wondering If you had come across something specific. Currently I am checking / experiementing with LeoLM/leo-mistral-hessianai-7b-chat · Hugging Face model and its applications for QA retrieval using llama index. I am also looking for the same answer as i think the accuracy depends heavily on the embedding model . Let me know your thoughts

mox · February 2, 2024, 7:19am

Hi Tim,

atm I am using intfloat/multilingual-e5-large which works okay. Also I wanted to check the JinaAI embeddings which look promising (“Jina Ich bin ein Berliner Embeddings”)

MarcGrumpyOlejak · February 19, 2024, 2:36pm

Hi @mox – a late answer.

I tested several different “smaller” embeddings before I start using possible Mistral-Embeddings and I stumbled upon “danielheinz/e5-base-sts-en-de”.

My “Use-Cases” are 100% for administrational documents – so there is a “huge” context (f.ex. check all german laws) which has to be embedded. I’ve tested (very very simple) the embeddings with 6 “short” searches with “synonyms” (~13.000 different lines of text) to find services of the administration.

Only with “e5-base-sts-en-de” I got 100% – failed RAG-embeddings-searches were the following:

multilingual-e5-base
paraphrase-multilingual-MiniLM-L12-v2
paraphrase-multilingual-mpnet-base-v2
gte-large
gbert-base

But this is only a momentary snapshot an is not representative.

hansegermeier · February 27, 2024, 4:02pm

Hi Marc,
we tried to use your recommended embedding model but we ran into an index out of range error:
IndexError: index out of range in self

according to this link this problem seems to be fixable by adjusting the vocab size:

github.com/huggingface/transformers

IndexError: index out of range in self

opened 07:19PM - 08 Jul 20 UTC

closed 08:41AM - 30 Jul 20 UTC

monk1337

# 🐛 Bug ## Information The model I am using Bert ('bert-large-uncased') an…d I am facing two issues related to this model The language I am using the model on English The problem arises when using: When I am trying to encode a large sentence ( sentence length 500 words ), I am getting this error : `IndexError: index out of range in self` I tried to set max_words length as 400, still getting same error : Data I am using can be downloaded like this : ``` from sklearn.datasets import fetch_20newsgroups import re categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med'] twenty_train = fetch_20newsgroups(subset='train',categories=categories, shuffle=True, random_state=42) print("\n".join(twenty_train.data[0].split("\n")[:3])) X_tratado = [] for email in range(0, len(twenty_train.data)): # Remover caracteres especiais texto = re.sub(r'\\r\\n', ' ', str(twenty_train.data[email])) texto = re.sub(r'\W', ' ', texto) # Remove caracteres simples de uma letra texto = re.sub(r'\s+[a-zA-Z]\s+', ' ', texto) texto = re.sub(r'\^[a-zA-Z]\s+', ' ', texto) # Substitui multiplos espaços por um unico espaço texto = re.sub(r'\s+', ' ', texto, flags=re.I) # Remove o 'b' que aparece no começo texto = re.sub(r'^b\s+', '', texto) # Converte para minúsculo texto = texto.lower() X_tratado.append(texto) dr = {} dr ['text'] = X_tratado dr ['labels'] = twenty_train.target ``` Now I am using bert model to encode the sentences : ``` from transformers import BertModel, BertConfig, BertTokenizer import torch tokenizer = BertTokenizer.from_pretrained('bert-large-uncased') model = BertModel.from_pretrained('bert-large-uncased') inputs = tokenizer(datar[7], return_tensors="pt") outputs = model(**inputs) features = outputs[0][:,0,:].detach().numpy().squeeze() ``` Which is giving this error : ``` --------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-41-5dcf440b245f> in <module> 5 model = BertModel.from_pretrained('bert-large-uncased') 6 inputs = tokenizer(datar[7], return_tensors="pt") ----> 7 outputs = model(**inputs) 8 features = outputs[0][:,0,:].detach().numpy().squeeze() ~/tfproject/tfenv/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs) 548 result = self._slow_forward(*input, **kwargs) 549 else: --> 550 result = self.forward(*input, **kwargs) 551 for hook in self._forward_hooks.values(): 552 hook_result = hook(self, input, result) ~/tfproject/tfenv/lib/python3.7/site-packages/transformers/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, output_attentions, output_hidden_states) 751 752 embedding_output = self.embeddings( --> 753 input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds 754 ) 755 encoder_outputs = self.encoder( ~/tfproject/tfenv/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs) 548 result = self._slow_forward(*input, **kwargs) 549 else: --> 550 result = self.forward(*input, **kwargs) 551 for hook in self._forward_hooks.values(): 552 hook_result = hook(self, input, result) ~/tfproject/tfenv/lib/python3.7/site-packages/transformers/modeling_bert.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds) 177 if inputs_embeds is None: 178 inputs_embeds = self.word_embeddings(input_ids) --> 179 position_embeddings = self.position_embeddings(position_ids) 180 token_type_embeddings = self.token_type_embeddings(token_type_ids) 181 ~/tfproject/tfenv/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs) 548 result = self._slow_forward(*input, **kwargs) 549 else: --> 550 result = self.forward(*input, **kwargs) 551 for hook in self._forward_hooks.values(): 552 hook_result = hook(self, input, result) ~/tfproject/tfenv/lib/python3.7/site-packages/torch/nn/modules/sparse.py in forward(self, input) 112 return F.embedding( 113 input, self.weight, self.padding_idx, self.max_norm, --> 114 self.norm_type, self.scale_grad_by_freq, self.sparse) 115 116 def extra_repr(self): ~/tfproject/tfenv/lib/python3.7/site-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse) 1722 # remove once script supports set_grad_enabled 1723 _no_grad_embedding_renorm_(weight, input, max_norm, norm_type) -> 1724 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) 1725 1726 IndexError: index out of range in self ``` The second issue I am facing, When I am using this bert model to encode many sentences, It seems Bert is not using GPU : ![Screenshot 2020-07-09 at 12 45 14 AM](https://user-images.githubusercontent.com/17107749/86960748-9c905980-c17d-11ea-8d1e-bb72141cbf37.png) How to accelerate GPU while using bert model? ## Environment info  - `transformers` version: '3.0.0' - Platform: Ubuntu 18.04.4 LTS - Python version: python3.7 - PyTorch version (GPU?): - Tensorflow version (GPU?): '2.2.0 - Using GPU in script?: - Using distributed or parallel set-up in script?:

Did you run in the same error and how did you fix it?

Thanks a lot!

hfg3 · March 13, 2024, 12:22pm

Thanks a lot! This also helps me in our use case (regulations, energy market).

MarcGrumpyOlejak · March 13, 2024, 2:19pm

Uh. No - sorry. I (gladly) did not stumble upon this problem … yet.

MarcGrumpyOlejak · March 13, 2024, 2:22pm

It’s not by accident the “BImSchG”? – “Bundesimmissionsschutzgesetz” (coz I wanted to test that as a PoC)

hansegermeier · March 13, 2024, 7:02pm

Thanks a lot for your reply!

hfg3 · March 14, 2024, 4:06pm

Hi marc, all relevant laws, regulations and other documents in the field of green fuels will be covered. this also includes the BlmSchG :). If you are interested in talking, we can exchange our views.

Stefan-LTB · May 23, 2024, 10:24am

Hi, I also try to implement a RAG application for german data, in the energy / construction sector.
Do I also have to add the prefixes, “query:” and “passage:” with the fine-tuned model “e5-base-sts-en-de”?
Also what llm do you use for german and how did you specify your prompts?
I just adjusted the standard prompts in LlamaIndex with “Only answer in german”, with llama3 as llm it seems to mostly work.

Topic		Replies	Views
German NLP Repository Languages at Hugging Face	11	4520	November 21, 2023
How to Use HuggingFace free Embedding models Beginners	3	4830	October 7, 2024
Retrieval Augmented Generation using Transformer Eco System 🤗Transformers	0	463	October 12, 2023
LLM models to train Aspect-based Sentiment Analysis in German Language Models	0	69	December 9, 2024
"sentence-transformers" return 404? Beginners	5	709	May 9, 2023

RAG Embeddings: German language

Related topics