Llama2 pad token for batched inference

From what I can tell, the recommended approach is usually to set the pad_token as the eos_token after loading a model. However, when running batched inference with Llama2, this approach fails.

To reproduce:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

# Define PAD Token = EOS Token
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id

# use different length sentences to test batching
sentences = [
    "Hello, my dog is a little",
    "Today, I",
    "I love",
    "Repeat after me: I love you.",
    "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    "You are a human that uses a lot of expletives. For each one of your responses to the user you use expletives liberally. If you find yourself questioning whether it's appropriate or not you're to say fuck that noise and keep using your expletives regardless. Human: Tell me a story. You:"
]


inputs = tokenizer(sentences, return_tensors="pt", padding=True).to(model.device)
print(inputs['input_ids'].shape)

output_sequences = model.generate(**inputs, max_new_tokens=20, do_sample=True, top_p=0.9)

print(tokenizer.batch_decode(output_sequences, skip_special_tokens=True))
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Potential solution: I’ve found that setting the pad_token = bos_token actually fixes the issue and allows for batched inference:

# Define PAD Token = BOS Token
tokenizer.pad_token = tokenizer.bos_token
model.config.pad_token_id = model.config.bos_token_id

I’m wondering if this is something special to the Llama2 model or not recommended for some particular reason?

Thanks in advance for any clarification!

1 Like

Maybe this would help you out: Llama2

I find the following working very well:

tokenizer.pad_token = "[PAD]"
tokenizer.padding_side = "left"

I used to use what you had, but I found that doing batch inference with that inference gives different results compared to sequential inference, which is not supposed to happen.

6 Likes

Works like a charm, thanks!

I got the same error using this code with llama-2-7b-chat-hf, however, llama-2-13b-chat-hf doesn’t yield such an error

After setting up efficient batching. What I think is done is an old trick. Concatenate a bunch of strings with an eos_token inbetween into one long continuous string, then chunk it.

I suspect this is what was done to train the model, truncating the last piece, thereby no need for a pad token.

Hi @LaferriereJC , I recently consider using this technique, which can be called as “sequence packing” I guess, but I can’t find much useful code for this. Could you please share some materials if you know any? Thanks in advance!

Wrote a custom dynamic programming solution here for 100% efficent batching