How to just get the answer from Llama-2 instead of repeating the whole prompt?

anon9969843 · September 22, 2023, 5:48am

Here is my script:

from transformers import AutoTokenizer, AutoModelForCausalLM   

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf")

prompt = """
CONTEXT: Harvard University is a private Ivy League research university in Cambridge, Massachusetts. 
Founded in 1636 as Harvard College and named for its first benefactor, the Puritan clergyman John Harvard, it is the oldest institution of higher learning in the United States. Its influence, wealth, and rankings have made it one of the most prestigious universities in the world.

QUESTION: Which year was Harvard University found?
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids
outputs = model.generate(input_ids, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))

Here is the output:

<s> 
CONTEXT: Harvard University is a private Ivy League research university in Cambridge, Massachusetts. 
Founded in 1636 as Harvard College and named for its first benefactor, the Puritan clergyman John Harvard, 
it is the oldest institution of higher learning in the United States. Its influence, wealth, 
and rankings have made it one of the most prestigious universities in the world.

QUESTION: Which year was Harvard University found?
ANSWER: Harvard University was founded in 1636.</s>

How to prompt in order to get just the answer instead of repeating the input prompt?

dblakely · September 22, 2023, 12:45pm

There’s no way to do this. The model is stateless so each time you call it, it needs to see the prompt for context about what you want it to do.

craighagerman · April 15, 2024, 7:56pm

In order to retrieve just the response you need to slice the output at the index of the input tokens. Like so using input_token_len:

input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_token_len = input_ids.shape[-1]
outputs = model.generate(input_ids, max_new_tokens=200)
print(tokenizer.decode(outputs[0][input_token_len:]))

Topic		Replies	Views
How can I prompt Llama to only use my provided context? 🤗Transformers	1	1658	March 2, 2024
Llama 2 repeats its prompt as output without answering the prompt 🤗Transformers	3	3619	September 30, 2024
meta-llama/Llama-2-7b-chat-hf weird responses, compared to the ones returned by the HF API 🤗Transformers	1	115	February 2, 2025
How to prompt Llama2 for text classification? Models	0	2558	September 22, 2023
LLama2 trained on completions only repeating prompt during inference Beginners	0	245	April 1, 2024

How to just get the answer from Llama-2 instead of repeating the whole prompt?

Related topics