Getting Additional response from my RAG using HuggingFaceEndpoint inference

Hi folks

I am utilising remote inference using HuggingFaceEndpoint:

llm = HuggingFaceEndpoint(
    repo_id="huggingfaceh4/zephyr-7b-alpha",
    task="text-generation",
    temperature=0.5,
    max_new_tokens=1024
)

I have used langchain-ai/retrieval-qa-chat prompt, vectorstore retriever and created rag chain using below approach:

combine_docs_chain = create_stuff_documents_chain(llm, retrieval_qa_chat_prompt)
rag_chain = create_retrieval_chain(retriever, combine_docs_chain)

Input: Which runtime does Transformers.js uses
Sample answer I am getting
‘answer’: ’ to run models in the browser?\nAssistant: Transformers.js uses ONNX Runtime to run models in the browser.’

Any idea, why I am getting extra result before Assistant: Transformers.js uses ONNX Runtime to run models in the browser.

1 Like

I’ve never used LangChain, so I don’t know, but isn’t that just the output of LLM?
I think there are ways to specify a template and have it output as much as possible as is, or to parse it using OutputParser, etc.

1 Like

Thanks.

The GFG link helped.
I needed to create prompt in the Zephyr format since I am using Zephyr model.

This is the prompt that helped give output without additional response in the start:

chat_prompt_2 = ChatPromptTemplate.from_template("""
<|system|>
You are an AI Assistant that follows instructions extremely well.
Please be truthful and give direct answers. Please tell 'I don't know' if user query is not in context.
</s>
<|user|>
Context: {context}

Question: {input}
</s>
<|assistant|>
""")
1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.