Getting Additional response from my RAG using HuggingFaceEndpoint inference

solo-leveling · March 16, 2025, 9:00am

Hi folks

I am utilising remote inference using HuggingFaceEndpoint:

llm = HuggingFaceEndpoint(
    repo_id="huggingfaceh4/zephyr-7b-alpha",
    task="text-generation",
    temperature=0.5,
    max_new_tokens=1024
)

I have used langchain-ai/retrieval-qa-chat prompt, vectorstore retriever and created rag chain using below approach:

combine_docs_chain = create_stuff_documents_chain(llm, retrieval_qa_chat_prompt)
rag_chain = create_retrieval_chain(retriever, combine_docs_chain)

Input: Which runtime does Transformers.js uses
Sample answer I am getting
‘answer’: ’ to run models in the browser?\nAssistant: Transformers.js uses ONNX Runtime to run models in the browser.’

Any idea, why I am getting extra result before Assistant: Transformers.js uses ONNX Runtime to run models in the browser.

John6666 · March 16, 2025, 1:13pm

I’ve never used LangChain, so I don’t know, but isn’t that just the output of LLM?
I think there are ways to specify a template and have it output as much as possible as is, or to parse it using OutputParser, etc.

solo-leveling · March 16, 2025, 4:48pm

Thanks.

The GFG link helped.
I needed to create prompt in the Zephyr format since I am using Zephyr model.

This is the prompt that helped give output without additional response in the start:

chat_prompt_2 = ChatPromptTemplate.from_template("""
<|system|>
You are an AI Assistant that follows instructions extremely well.
Please be truthful and give direct answers. Please tell 'I don't know' if user query is not in context.
</s>
<|user|>
Context: {context}

Question: {input}
</s>
<|assistant|>
""")

system · March 17, 2025, 4:48am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
RAG LLM Generating the Prompt also at the response Beginners	8	4254	September 25, 2024
Retrieval Augmented Generation using Transformer Eco System 🤗Transformers	0	467	October 12, 2023
Function Calling and RAG Features Using Open-Source LLMs Intermediate	0	804	December 21, 2023
Regarding Rag-end2end retriever 🤗Transformers	1	238	January 31, 2023
Incomplete/ partial response generation Models	3	1405	March 27, 2024

Getting Additional response from my RAG using HuggingFaceEndpoint inference

Related topics