RAG LLM Generating the Prompt also at the response

I was trying to build a RAG LLM model using opensource models. but while generating the response the llm is attaching the entire prompt and relevant document at the output. can anyone please tell me how can I remove the prompt and the Question section and get only the Answer in response ?

Code:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(โ€œEM_Theory.pdfโ€)
pages = loader.load_and_split()

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
text_chunks = text_splitter.split_documents(pages)

from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name = โ€˜sentence-transformers/all-mpnet-base-v2โ€™)
vector_store = FAISS.from_documents(text_chunks, embedding=embeddings)

from langchain.prompts import PromptTemplate
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
prompt_template = PromptTemplate(input_variables=[โ€˜chat_historyโ€™, โ€˜questionโ€™],
template=โ€˜โ€™โ€˜Given the following conversation and a follow up question,
rephrase the follow up question to be a standalone question,
in its original language. Only generate the answer of the asked question.
Donโ€™t generate the contexts and questions in output
\n\nChat History:\n{chat_history}\nFollow Up Input: {question}โ€™โ€˜โ€™)
memory = ConversationBufferMemory(memory_key=โ€œchat_historyโ€, return_messages=True)

chain = ConversationalRetrievalChain.from_llm(llm=llm, chain_type=โ€˜stuffโ€™,condense_question_prompt = prompt_template,
retriever=vector_store.as_retriever(search_kwargs={โ€œkโ€: 2}),
memory=memory)
query = โ€˜what is the Maxwellโ€™s equation?โ€™
history =
result = chain({โ€œquestionโ€: query, โ€œchat_historyโ€: history})
history.append((query, result[โ€œanswerโ€]))

print(result)

Output:
{โ€˜questionโ€™: โ€˜what is the Maxwellโ€™s equation?โ€™,
โ€˜chat_historyโ€™: [HumanMessage(content=โ€˜what is the Maxwellโ€™s equation?โ€™),
AIMessage(content=โ€œUse the following pieces of context to answer the question at the end. If you donโ€™t know the answer, just say that you donโ€™t know, donโ€™t try to make up an answer.\n\nLetโ€™s play physics 9681634157 \n10 \n \n \n \n \nWAVE EQUATION IN FREE SPACE \n Write down Maxwellโ€™s equation in free space. Obtain the wave equation for electric \nfield intensity from them. CU 1010, 09 ,06, 01 \n OR \nShow that Maxwellโ€™s equations suggest propagation of electromagnetic wave in a linear \nhomogeneous dielectric medium having no free charge. CU 2014 \n OR \n \nDerive the expression of speed of light from Maxwellโ€™s equations. CU 2015 \n \n Show that for a plane em wave in free space, the unit vector in the direction of \npropagation the electric and magnetic fields are mutually perpendi cular. 4 \n OR CU2011 ,06 , 05\n\nLetโ€™s play physics 9681634157 \n16 \n ๐›ปโƒ— ร—๐ปโƒ—โƒ— =๐ฝ +๐œ•๐ทโƒ—โƒ— \n๐œ•๐‘ก \n โˆด ๐›ปโƒ— โˆ™(๐ธโƒ— ร—๐ปโƒ—โƒ— )=๐ปโƒ—โƒ— โˆ™(๐›ปโƒ— ร—๐ธโƒ— )โˆ’๐ธโƒ— โˆ™(๐›ปโƒ— ร—๐ปโƒ—โƒ— ) \n \n=โˆ’๐ปโƒ—โƒ— โˆ™๐œ•๐ตโƒ— \n๐œ•๐‘กโˆ’๐ธโƒ— โˆ™(๐ฝ +๐œ•๐ทโƒ—โƒ— \n๐œ•๐‘ก) \n=โˆ’๐ปโƒ—โƒ— โˆ™๐œ•๐ตโƒ— \n๐œ•๐‘กโˆ’๐ธโƒ— โˆ™๐ฝ โˆ’๐ธโƒ— โˆ™๐œ•๐ทโƒ—โƒ— \n๐œ•๐‘ก \n For a linear medium ๐ทโƒ—โƒ— =โˆˆ๐ธโƒ— & ๐ตโƒ— =๐œ‡๐ปโƒ—โƒ— \nโˆด ๐›ปโƒ— โˆ™(๐ธโƒ— ร—๐ปโƒ—โƒ— )=โˆ’1\n2๐œ•\n๐œ•๐‘ก(๐ปโƒ—โƒ— โˆ™๐ตโƒ— )โˆ’1\n2๐œ•\n๐œ•๐‘ก(๐ธโƒ— โˆ™๐ทโƒ—โƒ— )โˆ’๐ธโƒ— โˆ™๐ฝ \n=โˆ’๐œ•\n๐œ•๐‘ก(1\n2๐ปโƒ—โƒ— โˆ™๐ตโƒ— +1\n2๐ธโƒ— โˆ™๐ทโƒ—โƒ— )โˆ’๐ธโƒ— โˆ™๐ฝ \nIntegrating above equations over a volume ๐‘‰ bounded by closed surface ๐‘† and \napplying divergence theorem, \nโˆฎ(๐ธโƒ— ร—๐ปโƒ—โƒ— ) \n๐‘†โˆ™๐‘‘๐‘† =โˆ’๐‘‘\n๐‘‘๐‘กโˆซ1\n2(๐ธโƒ— โˆ™๐ทโƒ—โƒ— +๐ตโƒ— โˆ™๐ปโƒ—โƒ— )๐‘‘๐‘‰โˆ’โˆซ๐ธโƒ— โˆ™๐ฝ \n๐‘ฃ \n๐‘ฃ ๐‘‘๐‘‰ \n \n๐‘‚๐‘Ÿ,โˆฎ(๐ธโƒ— ร—๐ปโƒ—โƒ— ) \n๐‘†โˆ™๐‘‘๐‘† +โˆซ๐ธโƒ— โˆ™๐ฝ \n๐‘ฃ ๐‘‘๐‘‰=โˆ’๐‘‘\n๐‘‘๐‘กโˆซ1\n2 \n๐‘ฃ(๐ธโƒ— โˆ™๐ทโƒ—โƒ— +๐ตโƒ— โˆ™๐ปโƒ—โƒ— )๐‘‘๐‘‰ \n It is the mathematical form of Poyntingโ€™s theorem. \nLet us now find a physical meaning of this equation. \na. The rate of work done by E.M. force on an element charge ๐‘‘๐‘ž (=๐œŒ ๐‘‘๐‘‰) is given \nby,\n\nQuestion: what is the Maxwellโ€™s equation?\nHelpful Answer: I do not have enough information about Maxellโ€™s equation therefore I cannot provide an answer.โ€)],
โ€˜answerโ€™: โ€œUse the following pieces of context to answer the question at the end. If you donโ€™t know the answer, just say that you donโ€™t know, donโ€™t try to make up an answer.\n\nLetโ€™s play physics 9681634157 \n10 \n \n \n \n \nWAVE EQUATION IN FREE SPACE \n Write down Maxwellโ€™s equation in free space. Obtain the wave equation for electric \nfield intensity from them. CU 1010, 09 ,06, 01 \n OR \nShow that Maxwellโ€™s equations suggest propagation of electromagnetic wave in a linear \nhomogeneous dielectric medium having no free charge. CU 2014 \n OR \n \nDerive the expression of speed of light from Maxwellโ€™s equations. CU 2015 \n \n Show that for a plane em wave in free space, the unit vector in the direction of \npropagation the electric and magnetic fields are mutually perpendi cular. 4 \n OR CU2011 ,06 , 05\n\nLetโ€™s play physics 9681634157 \n16 \n ๐›ปโƒ— ร—๐ปโƒ—โƒ— =๐ฝ +๐œ•๐ทโƒ—โƒ— \n๐œ•๐‘ก \n โˆด ๐›ปโƒ— โˆ™(๐ธโƒ— ร—๐ปโƒ—โƒ— )=๐ปโƒ—โƒ— โˆ™(๐›ปโƒ— ร—๐ธโƒ— )โˆ’๐ธโƒ— โˆ™(๐›ปโƒ— ร—๐ปโƒ—โƒ— ) \n \n=โˆ’๐ปโƒ—โƒ— โˆ™๐œ•๐ตโƒ— \n๐œ•๐‘กโˆ’๐ธโƒ— โˆ™(๐ฝ +๐œ•๐ทโƒ—โƒ— \n๐œ•๐‘ก) \n=โˆ’๐ปโƒ—โƒ— โˆ™๐œ•๐ตโƒ— \n๐œ•๐‘กโˆ’๐ธโƒ— โˆ™๐ฝ โˆ’๐ธโƒ— โˆ™๐œ•๐ทโƒ—โƒ— \n๐œ•๐‘ก \n For a linear medium ๐ทโƒ—โƒ— =โˆˆ๐ธโƒ— & ๐ตโƒ— =๐œ‡๐ปโƒ—โƒ— \nโˆด ๐›ปโƒ— โˆ™(๐ธโƒ— ร—๐ปโƒ—โƒ— )=โˆ’1\n2๐œ•\n๐œ•๐‘ก(๐ปโƒ—โƒ— โˆ™๐ตโƒ— )โˆ’1\n2๐œ•\n๐œ•๐‘ก(๐ธโƒ— โˆ™๐ทโƒ—โƒ— )โˆ’๐ธโƒ— โˆ™๐ฝ \n=โˆ’๐œ•\n๐œ•๐‘ก(1\n2๐ปโƒ—โƒ— โˆ™๐ตโƒ— +1\n2๐ธโƒ— โˆ™๐ทโƒ—โƒ— )โˆ’๐ธโƒ— โˆ™๐ฝ \nIntegrating above equations over a volume ๐‘‰ bounded by closed surface ๐‘† and \napplying divergence theorem, \nโˆฎ(๐ธโƒ— ร—๐ปโƒ—โƒ— ) \n๐‘†โˆ™๐‘‘๐‘† =โˆ’๐‘‘\n๐‘‘๐‘กโˆซ1\n2(๐ธโƒ— โˆ™๐ทโƒ—โƒ— +๐ตโƒ— โˆ™๐ปโƒ—โƒ— )๐‘‘๐‘‰โˆ’โˆซ๐ธโƒ— โˆ™๐ฝ \n๐‘ฃ \n๐‘ฃ ๐‘‘๐‘‰ \n \n๐‘‚๐‘Ÿ,โˆฎ(๐ธโƒ— ร—๐ปโƒ—โƒ— ) \n๐‘†โˆ™๐‘‘๐‘† +โˆซ๐ธโƒ— โˆ™๐ฝ \n๐‘ฃ ๐‘‘๐‘‰=โˆ’๐‘‘\n๐‘‘๐‘กโˆซ1\n2 \n๐‘ฃ(๐ธโƒ— โˆ™๐ทโƒ—โƒ— +๐ตโƒ— โˆ™๐ปโƒ—โƒ— )๐‘‘๐‘‰ \n It is the mathematical form of Poyntingโ€™s theorem. \nLet us now find a physical meaning of this equation. \na. The rate of work done by E.M. force on an element charge ๐‘‘๐‘ž (=๐œŒ ๐‘‘๐‘‰) is given \nby,\n\nQuestion: what is the Maxwellโ€™s equation?\nHelpful Answer: I do not have enough information about Maxellโ€™s equation therefore I cannot provide an answer.โ€}

I have also tried with mistralai/Mistral-7B-Instruct-v0.2 , NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO and mistralai/Mixtral-8x7B-Instruct-v0.1 . but got same kind of result.
langchain Version: 0.1.9

Can anyone solve this issue ?

1 Like

I have the same problem, have you found a solution?
but when I try to use the text2text generation model, the prompt doesnโ€™t appear in the response

No. I have not found any solution yet. so I am using regex to solve this issue

result['answer'] = re.split('Answer:',result['answer'])[-1]
1 Like

Facing the same issue with llama2

I found the parameter return_only_outputs in the langchain documentation for ConversationalRetrievalChain, maybe that will help. However, it is marked as deprecated.

Usually this problem is decoder related. During the generation of the response, the new tokens are always appended to the input sequence and re-input into the model to generate a new token until the eos token is generated. In huggingface transformers something like this can be done to decode only the new tokens:

tokenized_prompt = tokenizer(prompt, return_tensors="pt").to("cuda")
# generate new tokens
outputs = model.generate(**tokenized_prompt)
# decode only new tokens to string
tokenizer.decode(outputs[0][len(tokenized_prompt.input_ids[0]):])

Since we know the tokenized input length of the prompt (len(tokenized_prompt.input_ids[0])), we can give the sequence to the decoder and only decode from the end of the input sequence.

Perhaps something similar to huggingface transformers would be possible instead of split:

answer = result['answer'][len(query):]

However, you need to make sure that you get the string text of your prompt. I think Langchain always wraps everything in its own classes, thatโ€™s why I donโ€™t like working with langchain, you give up some control.

1 Like

I had this problem in the last days.
The only solution that I had was to downgrade LangChain to the version 0.1.6. Then it works fine again.

I also got same problem with llama3

I would highly recommend to follow this link to understanding the prompt format for llama2

If youโ€™re using transformer pipeline use: ```
return_full_text=False

Example:

pipe = transformers.pipeline(
      "text-generation",
      model=model,
      tokenizer= tokenizer,
      device_map="auto",
      max_new_tokens = 512,
      do_sample=True,
      return_full_text=False,
      top_k=10,
      num_return_sequences=1,
      eos_token_id=tokenizer.eos_token_id
)


from langchain.llms import HuggingFacePipeline
llm = HuggingFacePipeline(pipeline=pipe, model_kwargs={'temperature':0.1})