Llama-2 output from forward function is nonsense, `.generate()` is okay

Iā€™m getting complete nonsense when I use Llama-2ā€™s forward function, see below. The output I get with the .generate() function is a lot better.

The reason I need the forward function is because I have to train my model in a custom PyTorch training loop and as far as I understand, .generate() canā€™t be used for training.

Here is a minimal ā€œworkingā€ example:

import torch
from transformers import BitsAndBytesConfig, LlamaForCausalLM, LlamaForSequenceClassification, LlamaTokenizer

# model_id = "meta-llama/Llama-2-7b-chat-hf"
model_id = "meta-llama/Llama-2-7b-hf"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
tokenizer = LlamaTokenizer.from_pretrained(model_id)
tokenizer.add_special_tokens({"pad_token": "<pad>"})
model = LlamaForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto", cache_dir="./cache")
model.resize_token_embeddings(len(tokenizer))
model.config.pad_token_id = tokenizer.pad_token_id
model.eval()

model_input = tokenizer(
    # "Hello, how are you? ###Assistant:", 
    "Hello, how are you?", 
    return_tensors="pt",
    max_length=20,
    truncation=True
    # padding="max_length",
)
model_input["input_ids"] = model_input["input_ids"].to("cuda")
model_input["attention_mask"] = model_input["attention_mask"].to("cuda")

model_output = model.generate(model_input['input_ids'], max_new_tokens=50)
# print(model_output)
output_string = tokenizer.batch_decode(model_output)[0]
print("Output with `.generate()`:\n" + output_string)
print("\n")

model_output = model(**model_input)
# print(model_output.logits.shape)
output_string = tokenizer.decode(torch.argmax(model_output.logits.squeeze(), -1))
print("Output with `.forward()`:\n" + output_string)

Output:

Loading checkpoint shards: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 2/2 [00:07<00:00,  3.87s/it]
Output with `.generate()`:
<s> Hello, how are you? Iā€™m doing well, thanks for asking. everybody is in good health, so I am happy. I hope you are well too.
Iā€™m very glad that you have visited my website. Iā€™m sure you are looking for a


Output with `.forward()`:
nobody, I are you? I

It might be worth noting that the output from the .generate()function changes everytime I rerun the script but the forward function always gives me the same gibberish. Sometimes, .generate() also gives me gibberish that is at least somewhat grammatical, or it slips into (grammatical) German text unprompted.

Iā€™m using 4-bit quantization, could that have anything to do with it?

Iā€™ve been trying to troubleshoot this for two weeks and Iā€™m getting really desperate. Any help would be so much appreciated.

1 Like

Hey!

The generation in auto-regressive models, like Llama, cannot be done in one forward pass. Usually you have to generate tokens one by one, taking the last logit, which is the prediction for the next token. So in we can do as below. Note that if you have a batched example, you have to also pass in attention_mask into the modelā€™s forward, and every step in the loop update it the same way we updates ā€˜generated_textā€™. In other words, add ā€œ1ā€ to the attn, because we definitely should attend to the new token added.

Also note, that during generation we use a padding_side="left", but during training we have to use padding_side="right". You can set it in tokenizer.padding_side.

generated_text = model_input.input_ids
for i in range(20): # 20 as the max length here, as an ex
    model_output = model(generated_text)
    next_tokens = model_output[:, -1:, :].argmax(dim=-1)
    generated_text = torch.cat([generated_text, next_tokens], dim=-1)

output_string = tokenizer.decode(generated_text[0])
print("Output with `.forward()`:\n" + output_string)

Regarding different generations every time we call Llama, itā€™s because Llama has a do_sample=True in its generation config, so every time it randomly samples a token from logits distribution. To get a determinictic text, set ā€œdo_sample=Falseā€ in generate. See here for more arguments that you can tweak in generation.

For more on training you can have a look:

  1. Gemma training blog post, since itā€™s almost the same as Llama
  2. Ax example from trl for supervised fine-tuning
1 Like

Thank you so much for the quick and helpful reply!

Now everything makes sense. So from my understanding, the output that I got from the forward function nobody, I are you? I is made up of two parts. The first part is the past predictions, where the model tried to predict my prompt Hello, how are you? and instead got nobody, I are you?. The second part is its prediction for the next token, which was I.

Thank you for the demonstration on how to actually implement the auto-regressive generation as well. From what I understand now, though, that will not be necessary if I want to just purely train the model, is that correct? Just one forward pass with the whole text should be enough, right?
The reason I originally wanted to decode the output was just to double-check if my model was working correctly and I thought it didnā€™t when it didnā€™t produce what I was expecting. But itā€™s very good to know in case I do want to inspect the output during training in the future.

For posterity, I was also told that the way I added the padding token is actually not ideal because it adds an extra dimension that will be untrained/randomly initialized. This can cause problems with the argmax further down the line.
Instead, I can just do the following, although Iā€™m not sure yet if the second line is necessary

tokenizer.pad_token_id = tokenizer.eos_token_id                                                 
model.config.pad_token_id = tokenizer.pad_token_id

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.