Use RAGAS with huggingface LLM

Hello everybody,

I want to use the RAGAS lib to evaluate my RAG pipeline. The evaluation model should be a huggingface model like Llama-2, Mistral, Gemma and more. How can I implement it with the named library or is there another solution?

The examples by the team Examples by RAGAS team aren’t helpful for me, because they doesn’t show, how to use specific Huggingface model. Also a specifc langchain pipeline doesn’t work, because it isn’t the required class.

Best regards
Christian

1 Like

Hi,
take a look at this post. I think you will find it helpful :slight_smile:

2 Likes

If I read this correctly, some metrics also need an embedding model. The easiest way would be to put them in langchain wrappers as follows:

from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from langchain import HuggingFacePipeline
from langchain_community.embeddings import HuggingFaceEmbeddings
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
)
from ragas import evaluate

# embedding model
embedding_model = HuggingFaceEmbeddings("my-model-id")

# evaluator
model_id = "my-evaluator-id"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

pipe = pipeline(
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    temperature=0.1, 
    repetition_penalty=1.1  # without this output begins repeating
)

evaluator = HuggingFacePipeline(pipeline=pipe)

# ragas
result = evaluate(
    dataset=dataset,
    llm=evaluator,
    embeddings=embedding_model,
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
    ],
)

I had recently implemented something similar. You have to have sentence-transformers and the latest version of transformers installed, otherwise I got problems with sentence-transformers.

But with the latest version of transformers I had problems importing the pipeline (some tensorflow error). But you can also create them yourself.

class CustomPipeline(Pipeline):
    def _sanitize_parameters(self, **kwargs):
        preprocess_kwargs = {}
        return preprocess_kwargs, {}, {}

    def preprocess(self, text):
        return self.tokenizer(text,  return_tensors="pt").to(self.device)

    def _forward(self, inputs):
        outputs = self.model.generate(**inputs)
        return outputs

    def postprocess(self, outputs):
        outputs = self.tokenizer.decode(outputs[0])
        return outputs

# init pipe
pipe = CustomPipelne(
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    temperature=0.1, 
    repetition_penalty=1.1
)

# langchain wrapper
evaluator = HuggingFacePipeline(pipeline=pipe)
2 Likes

Hii @CKeibel ,

I want to evaluate my RAG using open-source LLM instead of GPT-4. I attached code snippets for the RAGAS evaluation. I want to make sure that I am passing llm in the evaluate function correctly or not? I am getting error saying that “WARNING:ragas.llms.output_parser:Failed to parse output. Returning None.” I have checked the several logs and I think the problem is in the passing the llm in the evaluate function.


from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.chat_models.huggingface import ChatHuggingFace
from langchain_community.llms import HuggingFaceHub

embedding_model = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")
hugging_llm = HuggingFaceHub(
repo_id="HuggingFaceH4/zephyr-7b-beta",
task="text-generation",
model_kwargs={
"max_new_tokens": 512,
"top_k": 30,
"temperature": 0.1,
"repetition_penalty": 1.03,
},
)

llm = LangchainLLMWrapper(hugging_llm)
embeddings = LangchainEmbeddingsWrapper(embedding_model)

from ragas import evaluate
from ragas.metrics import context_precision
result = evaluate(
dataset=eval_dataset,
llm=llm,
embeddings=embeddings,
metrics=[
context_precision,
],
)


Thanks!

1 Like

Hi Keibel!
Thank you very much for your developed response. Even so, it still gives me some errors.

When you define the following class:

class CustomPipeline(Pipeline):

The Pipeline you use is imported from from transformers.pipelines.base import Pipeline?

Following your second code option gives me the following error:

  File "/usr/local/lib/python3.10/dist-packages/ragas/llms/base.py", line 147, in generate_text
    result = self.langchain_llm.generate_prompt(
  File "/usr/local/lib/python3.10/dist-packages/langchain_core/language_models/llms.py", line 633, in generate_prompt
    return self.generate(prompt_strings, stop=stop, callbacks=callbacks, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/langchain_core/language_models/llms.py", line 803, in generate
    output = self._generate_helper(
  File "/usr/local/lib/python3.10/dist-packages/langchain_core/language_models/llms.py", line 670, in _generate_helper
    raise e
  File "/usr/local/lib/python3.10/dist-packages/langchain_core/language_models/llms.py", line 657, in _generate_helper
    self._generate(
  File "/usr/local/lib/python3.10/dist-packages/langchain_community/llms/huggingface_pipeline.py", line 279, in _generate
    text = response["generated_text"]
TypeError: string indices must be integers

Do you know what it could be? Maybe Pipeline isn’t importing it well.

Thank you very much first of all! :slight_smile:
Saioa

Hello @kp1264!

I’m looking for different ways to use Ragas with Llama-3 as the evaluator LLM, and I get this same warning that you mentioned:

“WARNING:ragas.llms.output_parser:Failed to parse output. Returning None.”

Besides, the metrics are Nan of course.

{‘faithfulness’: nan}

Did you manage to solve it?

Thank you very much!

I am getting the same error of,

“WARNING:ragas.llms.output_parser:Failed to parse output. Returning None.”

I tried including the trace using Langsmith to check for requests and responses. For the given input prompt I believe it is an issue of context length because I do get a blank response. I tried different LLMs but the error remains the same.

Please suggest any open-source models to evaluate.

What eval datasets can I use for this? I’m getting error’s in the Squad dataset.

@scarte @sheetalkamthe55
Were you able to resolve the error? I’m also facing similar error. I’m using mistral 7B, and utilized the max context length of 8k, but still stuck at this part.

Hi @himjoshi !

In the end I solved it by loading the model with Ollama, as explained in this tutorial:

You load the model you want with ChatOllama, which in your case will be mistral:

from langchain_community.chat_models import ChatOllama

llm = ChatOllama(model="mistral")

and the embeddings if you want them also with OllamaEmbeddings.

Sorry if it’s not very helpful, but it’s how I was able to move forward.
If you find another solution, please share it here! thank you!

Saioa :slight_smile:

1 Like

Hi @codegood !

You can use “explodinggradients/amnesty_qa” dataset, loading it as:

from datasets import load_dataset

data = load_dataset("explodinggradients/amnesty_qa", "english_v2")

Hey I am facing same error did you solve the error please let me know
I am using code as
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.chat_models.huggingface import ChatHuggingFace
from langchain.llms import HuggingFaceHub

embedding_model = HuggingFaceEmbeddings(model_name=“BAAI/bge-small-en-v1.5”)
hugging_llm = HuggingFaceHub(
repo_id=“HuggingFaceH4/zephyr-7b-beta”,
task=“text-generation”,
model_kwargs={
“max_new_tokens”: 512,
“top_k”: 30,
“temperature”: 0.1,
“repetition_penalty”: 1.03,
},
)

import nest_asyncio
nest_asyncio.apply()
from datasets import Dataset
import os
from ragas import evaluate
from ragas.metrics import faithfulness, answer_correctness

data_samples = {
‘question’: [‘When was the first super bowl?’, ‘Who won the most super bowls?’],
‘answer’: [‘The first superbowl was held on Jan 15, 1967’, ‘The most super bowls have been won by The New England Patriots’],
‘contexts’ : [[‘The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,’],
[‘The Green Bay Packers…Green Bay, Wisconsin.’,‘The Packers compete…Football Conference’]],
‘ground_truth’: [‘The first superbowl was held on January 15, 1967’, ‘The New England Patriots have won the Super Bowl a record six times’]
}

dataset = Dataset.from_dict(data_samples)

score = evaluate(dataset,metrics=[faithfulness,answer_correctness], llm=hugging_llm, embeddings = embedding_model )
score.to_pandas()

Output:
WARNING:ragas.llms.output_parser:Failed to parse output. Returning None.
WARNING:ragas.llms.output_parser:Failed to parse output. Returning None.
WARNING:ragas.llms.output_parser:Failed to parse output. Returning None.

Hey @CKeibel , By default when importing the RAGAs in my jupyter notebook it’s throwing an error regarding the OAI keys, did you come across this kind of error while implementing it? If possible can you provide me the package version you are using?

Thanks,
Mano