Llama-2 output from forward function is nonsense, `.generate()` is okay

Tai-Mai · May 27, 2024, 5:36pm

I’m getting complete nonsense when I use Llama-2’s forward function, see below. The output I get with the .generate() function is a lot better.

The reason I need the forward function is because I have to train my model in a custom PyTorch training loop and as far as I understand, .generate() can’t be used for training.

Here is a minimal “working” example:

import torch
from transformers import BitsAndBytesConfig, LlamaForCausalLM, LlamaForSequenceClassification, LlamaTokenizer

# model_id = "meta-llama/Llama-2-7b-chat-hf"
model_id = "meta-llama/Llama-2-7b-hf"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
tokenizer = LlamaTokenizer.from_pretrained(model_id)
tokenizer.add_special_tokens({"pad_token": "<pad>"})
model = LlamaForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto", cache_dir="./cache")
model.resize_token_embeddings(len(tokenizer))
model.config.pad_token_id = tokenizer.pad_token_id
model.eval()

model_input = tokenizer(
    # "Hello, how are you? ###Assistant:", 
    "Hello, how are you?", 
    return_tensors="pt",
    max_length=20,
    truncation=True
    # padding="max_length",
)
model_input["input_ids"] = model_input["input_ids"].to("cuda")
model_input["attention_mask"] = model_input["attention_mask"].to("cuda")

model_output = model.generate(model_input['input_ids'], max_new_tokens=50)
# print(model_output)
output_string = tokenizer.batch_decode(model_output)[0]
print("Output with `.generate()`:\n" + output_string)
print("\n")

model_output = model(**model_input)
# print(model_output.logits.shape)
output_string = tokenizer.decode(torch.argmax(model_output.logits.squeeze(), -1))
print("Output with `.forward()`:\n" + output_string)

Output:

Loading checkpoint shards: 100%|██████████| 2/2 [00:07<00:00,  3.87s/it]
Output with `.generate()`:
<s> Hello, how are you? I’m doing well, thanks for asking. everybody is in good health, so I am happy. I hope you are well too.
I’m very glad that you have visited my website. I’m sure you are looking for a


Output with `.forward()`:
nobody, I are you? I

It might be worth noting that the output from the .generate()function changes everytime I rerun the script but the forward function always gives me the same gibberish. Sometimes, .generate() also gives me gibberish that is at least somewhat grammatical, or it slips into (grammatical) German text unprompted.

I’m using 4-bit quantization, could that have anything to do with it?

I’ve been trying to troubleshoot this for two weeks and I’m getting really desperate. Any help would be so much appreciated.

RaushanTurganbay · May 27, 2024, 6:58pm

Hey!

The generation in auto-regressive models, like Llama, cannot be done in one forward pass. Usually you have to generate tokens one by one, taking the last logit, which is the prediction for the next token. So in we can do as below. Note that if you have a batched example, you have to also pass in attention_mask into the model’s forward, and every step in the loop update it the same way we updates ‘generated_text’. In other words, add “1” to the attn, because we definitely should attend to the new token added.

Also note, that during generation we use a padding_side="left", but during training we have to use padding_side="right". You can set it in tokenizer.padding_side.

generated_text = model_input.input_ids
for i in range(20): # 20 as the max length here, as an ex
    model_output = model(generated_text)
    next_tokens = model_output[:, -1:, :].argmax(dim=-1)
    generated_text = torch.cat([generated_text, next_tokens], dim=-1)

output_string = tokenizer.decode(generated_text[0])
print("Output with `.forward()`:\n" + output_string)

Regarding different generations every time we call Llama, it’s because Llama has a do_sample=True in its generation config, so every time it randomly samples a token from logits distribution. To get a determinictic text, set “do_sample=False” in generate. See here for more arguments that you can tweak in generation.

For more on training you can have a look:

Gemma training blog post, since it’s almost the same as Llama
Ax example from trl for supervised fine-tuning

Tai-Mai · May 27, 2024, 9:13pm

Thank you so much for the quick and helpful reply!

Now everything makes sense. So from my understanding, the output that I got from the forward function nobody, I are you? I is made up of two parts. The first part is the past predictions, where the model tried to predict my prompt Hello, how are you? and instead got nobody, I are you?. The second part is its prediction for the next token, which was I.

Thank you for the demonstration on how to actually implement the auto-regressive generation as well. From what I understand now, though, that will not be necessary if I want to just purely train the model, is that correct? Just one forward pass with the whole text should be enough, right?
The reason I originally wanted to decode the output was just to double-check if my model was working correctly and I thought it didn’t when it didn’t produce what I was expecting. But it’s very good to know in case I do want to inspect the output during training in the future.

For posterity, I was also told that the way I added the padding token is actually not ideal because it adds an extra dimension that will be untrained/randomly initialized. This can cause problems with the argmax further down the line.
Instead, I can just do the following, although I’m not sure yet if the second line is necessary

tokenizer.pad_token_id = tokenizer.eos_token_id                                                 
model.config.pad_token_id = tokenizer.pad_token_id

system · May 28, 2024, 9:13am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Understanding Output of `PreTrainedModel.forward` Beginners	2	1842	February 12, 2024
How to extend model.generate() to accept additional parameters to be used by the forward of Llama 🤗Transformers	0	90	October 2, 2024
Prompt printing gibberish Beginners	1	678	September 15, 2023
Making llama text generation, deterministic Models	1	9637	August 16, 2023
Unisloth 4-bit Llama models acting weirdly when used in a Function Beginners	0	165	May 8, 2024

Llama-2 output from forward function is nonsense, `.generate()` is okay

Related topics