Llama-2 7B-hf repeats context of question directly from input prompt, cuts off with newlines

Context: I am trying to query Llama-2 7B, taken from HuggingFace (meta-llama/Llama-2-7b-hf). I give it a question and context (I would guess anywhere from 200-1000 tokens), and ask it to answer the question based on the context (context is retrieved from a vectorstore using similarity search). Here are my two problems:

  1. The answer ends, and the rest of the tokens until it reaches max_new_tokens are all newlines. Or it just doesn’t generate any text and the entire response is newlines. Adding a repetition_penalty of 1.1 or greater has solved infinite newline generation, but does not get me full answers.
  2. For answers that do generate, they are copied word for word from the given context. This remains the same with repetition_penalty=1.1, and making the repetition penalty too high makes the answer nonsense.

I have only tried using temperature=0.4 and temperature=0.8, but from what I have done, tuning temperature and repetition_penalty both result in either the context being copied or a nonsensical answer.

Note about the “context”: I am using a document stored in a Chroma vector store, and similarity search retrieves the relevant information before I pass it to Llama.

Example Problem:
My query is to summarize a certain Topic X.

query = "Summarize Topic X"

The retrieved context from the vectorstore has 3 sources that looks something like this (I format the sources in my query to the LLM separated by newlines):

context = """When talking about Topic X, Scenario Y is always referred to. This is due to the relation of
Topic X is a broad topic which covers many aspects of life.
No one knows when Topic X became a thing, its origin is unknown even to this day."""

Then the response from Llama-2 directly mirrors one piece of context, and includes no information from the others. Furthermore, it produces many newlines after the answer. If the answer is 100 tokens, and max_new_tokens is 150, I have 50 newlines.

response = "When talking about Topic X, Scenario Y is always referred to. This is due to the relation of \n\n\n\n"

One of my biggest issues is that in addition to copying one piece of context, if the context ends mid-sentence, so does the LLM response.


Is anyone else experiencing anything like this (newline issue or copying part of your input prompt)? Has anyone found a solution?

2 Likes

Kind of having the same issue here. Sometimes I am expecting a long answer so I set the max_new_tokens to a high number. But if I do that and I am expecting a short answer, the model responds and then adds part of my input prompt until it reaches the max_new_tokens value. I have seen examples in Llama-1 where the model can give both short and long answers without including any nonsense words as padding to reach max_new_tokens. Did I do something wrong during fine-tuning?

I would have some questions first: How are you using the model (with generate, pipeline, etc.)? Would it be possible to give the final formatted prompt that you forward to your model as input?
Apart from this: You are giving the llama an instruction. But I guess for this purpose meta-llama/Llama-2-7b-chat-hf would be the better choice.

I’m using the model with model.generate(), not the pipeline.
The prompt looked something like:

"""Use the given question to guide your summary about the context. If you don't know the answer, just say that you don't know, don't try to make up an answer.
Question: Summarize Topic X
Context: 
When talking about Topic X, Scenario Y is always referred to. This is due to the relation of

Topic X is a broad topic which covers many aspects of life.

No one knows when Topic X became a thing, its origin is unknown even to this day.

Summary:"""

Could you please elaborate further on why you think the chat model would be better for this purpose?

I was able to reproduce the behavior you described. Afterwards I tried it with the chat model and it hardly was better. Then I tried to reproduce the example Huggingface gave here: Llama 2 is here - get it on Hugging Face (in the Inference section). The model in this example was asked

I liked “Breaking Bad” and “Band of Brothers”. Do you have any recommendations of other shows I might like?

Turns out: the result they have shown was cherry picked, sometimes the model just continues with

I am a fan of crime dramas, but I also enjoy historical dramas and comedies. I am open to watching something new and different, but I want it to be good quality and engaging.
Please let me know if you have any suggestions.
Thanks!

So I checked the original implementation and you need to to put a [INST] token at the beginning and a [/INST] token at the end. When I give the prompt

[INST]I liked “Breaking Bad” and “Band of Brothers”. Do you have any recommendations of other shows I might like?[/INST]

it always gives a reasonable answer.

For your usecase my prompt was

[INST]Use the given question to guide your summary about the context. If you don’t know the answer, just say that you don’t know, don’t try to make up an answer.
Question: Summarize Topic X
Context:
When talking about Topic X, Scenario Y is always referred to. This is due to the relation of
Topic X is a broad topic which covers many aspects of life.
No one knows when Topic X became a thing, its origin is unknown even to this day.[/INST]

and it worked. I’m not judging quality of the summary here - but the model understood what it should do.

Concerning the difference between the foundation model and the chat model: The chat model knows when it should response to something. Basically your approach of prompting was correct for a foundation model since your last word was “summary:” However the model found ways to work around the task of giving a summary while the chat model is trained to follow strictly the instructions.

Edit: later in the blog post referenced above the correct prompting scheme was given and this example provided:

<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

There's a llama in my garden đŸ˜± What should I do? [/INST]
1 Like

Thanks for the advice! It was very helpful, and I appreciate you looking into the original implementation. However, I may be doing something wrong if

worked for you. I tried that, and it doesn’t make newlines which would be great. The issue is now it repeats the exact query I input into model.generate, and my output looks like this:

[INST]Use the given question to guide your summary about the context. If you don’t know the answer, just say that you don’t know, don’t try to make up an answer.
Question: Summarize the given text
Context: The text is about the history of Topic X.
Topic X is a broad topic which covers many aspects of life. No one knows when Topic X became a thing, its origin is unknown even to this day.
Topic X is a broad topic which covers many aspects of life. No one knows when Topic X became a thing, its origin is unknown even to this day. [/INST]
[INST]Use the given question to guide your summary about the context. If you don’t know the answer, just say that you don’t know, don’t try to make up an answer. Question: Summarize the given text Context: The text is about the
[INST]Use the

It repeats the query until it cuts off. It is interesting how it slightly varies it, like adding a space to turn this day.[/INST] into this day. [/INST]. Was there anything else that you did while testing?

@Khagerman I don’t think is is an issue, it’s just that the outputs are probably including the input.

When you decode the outputs, you need to feed in the inputs as an argument to tokenizer.decode so that the decoder knows not to return the inputs.

That’s my guess because it regularly happens to me when using generate (i.e. not using the pipeline).

1 Like

@RonanMcGovern I thought something similar, but when I changed my model to TheBloke/Nous-Hermes-Llama2-GPTQ, and changed nothing else about my generation, it worked fine

Even though my problem was fixed, I feel I should keep the question open since the problem itself hasn’t been solved for the model that the question is for.

@Khagerman can you paste your inference script here - include the encoding and decoding steps. Cheers

Here’s the script. Just one note, I’m using a wrapper class here, which is why it is self.model/tokenizer, and gen_config is taken from GenerationConfig.from_model_config(self.model.generation_config) so I can change max_new_tokens on the fly.

inputs = self.tokenizer(query, return_tensors='pt').to(self.model.device)
outputs = self.model.generate(
    input_ids=inputs['input_ids'], 
    attention_mask=inputs['attention_mask'],
    generation_config=gen_config
)
output_str = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
1 Like

ok, have you tried printing the output_str to screen and seeing if that contains the inputs?

Here is the type of code snippet I use to grab only the new tokens:

new_tokens = generation_output[0][input_ids.shape[-1]:]

in your case this would be outputs[0][input_ids.shape[-1]:] - this basically grabs everything after the input tokens.

1 Like

I didn’t realize you could do that to remove the input!
I did something similar, as I’m realizing I forgot a key part of my generation process:

# Removes the original query from the generated response.
return output_str.replace(query, "")

However, I think this should show that what is generated after is the LLM copying a portion of the input context, but not all of it, which is why it isn’t replaced.

1 Like

I am not sure I understand your question well or not. If you want the answer from Llama 2 to not include the prompt you provide, you can use return_full_text=False

sequences = pipeline(
    myPrompt,
    do_sample=True,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=4096, # max lenght of output, default=4096
    return_full_text=False, # to not repeat the question, set to False
    top_k=10, # default=10
    # top_p=0.5, # default=0.9
    temperature=0.6, # default=0.
)```

Hi @kanasva , it’s a good idea. can you share where you find the documents of args the model can take?