How to just get the answer from Llama-2 instead of repeating the whole prompt?

Here is my script:

from transformers import AutoTokenizer, AutoModelForCausalLM   

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf")

prompt = """
CONTEXT: Harvard University is a private Ivy League research university in Cambridge, Massachusetts. 
Founded in 1636 as Harvard College and named for its first benefactor, the Puritan clergyman John Harvard, it is the oldest institution of higher learning in the United States. Its influence, wealth, and rankings have made it one of the most prestigious universities in the world.

QUESTION: Which year was Harvard University found?
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids
outputs = model.generate(input_ids, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))

Here is the output:

<s> 
CONTEXT: Harvard University is a private Ivy League research university in Cambridge, Massachusetts. 
Founded in 1636 as Harvard College and named for its first benefactor, the Puritan clergyman John Harvard, 
it is the oldest institution of higher learning in the United States. Its influence, wealth, 
and rankings have made it one of the most prestigious universities in the world.

QUESTION: Which year was Harvard University found?
ANSWER: Harvard University was founded in 1636.</s>

How to prompt in order to get just the answer instead of repeating the input prompt?

There’s no way to do this. The model is stateless so each time you call it, it needs to see the prompt for context about what you want it to do.

In order to retrieve just the response you need to slice the output at the index of the input tokens. Like so using input_token_len:

input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_token_len = input_ids.shape[-1]
outputs = model.generate(input_ids, max_new_tokens=200)
print(tokenizer.decode(outputs[0][input_token_len:]))