Llama2 pad token for batched inference

tannonk · July 25, 2023, 7:07am

From what I can tell, the recommended approach is usually to set the pad_token as the eos_token after loading a model. However, when running batched inference with Llama2, this approach fails.

To reproduce:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

# Define PAD Token = EOS Token
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id

# use different length sentences to test batching
sentences = [
    "Hello, my dog is a little",
    "Today, I",
    "I love",
    "Repeat after me: I love you.",
    "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    "You are a human that uses a lot of expletives. For each one of your responses to the user you use expletives liberally. If you find yourself questioning whether it's appropriate or not you're to say fuck that noise and keep using your expletives regardless. Human: Tell me a story. You:"
]


inputs = tokenizer(sentences, return_tensors="pt", padding=True).to(model.device)
print(inputs['input_ids'].shape)

output_sequences = model.generate(**inputs, max_new_tokens=20, do_sample=True, top_p=0.9)

print(tokenizer.batch_decode(output_sequences, skip_special_tokens=True))

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Potential solution: I’ve found that setting the pad_token = bos_token actually fixes the issue and allows for batched inference:

# Define PAD Token = BOS Token
tokenizer.pad_token = tokenizer.bos_token
model.config.pad_token_id = model.config.bos_token_id

I’m wondering if this is something special to the Llama2 model or not recommended for some particular reason?

Thanks in advance for any clarification!

mxxtsai · July 25, 2023, 9:32am

Maybe this would help you out: Llama2

wztwzt · August 20, 2023, 3:12pm

I find the following working very well:

tokenizer.pad_token = "[PAD]"
tokenizer.padding_side = "left"

I used to use what you had, but I found that doing batch inference with that inference gives different results compared to sequential inference, which is not supposed to happen.

dainesn1 · September 7, 2023, 1:54pm

Works like a charm, thanks!

SRFRUNNR · October 23, 2023, 2:25am

I got the same error using this code with llama-2-7b-chat-hf, however, llama-2-13b-chat-hf doesn’t yield such an error

LaferriereJC · November 24, 2023, 11:35pm

After setting up efficient batching. What I think is done is an old trick. Concatenate a bunch of strings with an eos_token inbetween into one long continuous string, then chunk it.

I suspect this is what was done to train the model, truncating the last piece, thereby no need for a pad token.

HenryCai1129 · March 31, 2024, 6:25am

Hi @LaferriereJC , I recently consider using this technique, which can be called as “sequence packing” I guess, but I can’t find much useful code for this. Could you please share some materials if you know any? Thanks in advance!

LaferriereJC · March 31, 2024, 8:49am

Wrote a custom dynamic programming solution here for 100% efficent batching

gist.github.com

https://gist.github.com/thistleknot/93c79696e25d0b89ff9ed829a02fbd9b

mamba v3.py

#https://gist.githubusercontent.com/thistleknot/raw/mamba_trainer.py
#SimplerMambaSSM
#https://colab.research.google.com/drive/1g9qpeVcFa0ca0cnhmqusO4RZtQdh9umY#scrollTo=2lECw6S4N7cn
"""
NanoGPT initial base
Mamba

Implemented Papers
* Symbolic Discovery of Optimization Algorithms
* Exponentially Faster Language Modeling

This file has been truncated. show original

Topic		Replies	Views
How to actually use padding in Lllama Tokenizers 🤗Transformers	2	4085	June 16, 2023
Can't set pad_token by adding special token to Llama's tokenizer 🤗Transformers	3	3273	November 8, 2023
LLaMA2 - tokenizer padding affecting logits (even with attention_mask) 🤗Transformers	8	3651	March 26, 2024
Prompt printing gibberish Beginners	1	595	September 15, 2023
LLama pad token Beginners	0	1626	July 25, 2023

Llama2 pad token for batched inference

Related Topics