Llama inference with apply_chat_template

Hello,

I’m trying to follow the docs with Chat template but facing some errors.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import pandas as pd


MODEL_ID = "llama3.2-3binstruct-cache-localpath"
CSV_FILE = "mycsv.csv"
COLUMN_NAME = "column_name"



print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, padding_side="left")
tokenizer.pad_token_id = tokenizer.eos_token_id 
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=torch.float16, device_map="auto")


data = pd.read_csv(CSV_FILE)
docs = data[COLUMN_NAME].tolist()


chat_template_result = []


doc = docs[0]

chat = [
    {"role": "system", "content": "You are an AI that helps to summarize document."},
    {"role": "user", "content": f"Document: {doc}\nGiven this document, summarize in one sentence concisely."},
]

tokenized_chat = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, return_tensors="pt")

print(tokenized_chat)
print("---"*20)

print(tokenizer.decode(tokenized_chat[0]))
print("---"*20)

outputs= model.generate(
    input_ids=tokenized_chat.to(model.device),
    max_new_tokens=50,
    pad_token_id=tokenizer.eos_token_id,
)

print(tokenizer.decode(outputs[0]))

My error messages are below.

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
../aten/src/ATen/native/cuda/TensorCompare.cu:110: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion `probability tensor contains either `inf`, `nan` or element < 0` failed.

  File "/file/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/file/lib/python3.10/site-packages/transformers/generation/utils.py", line 2215, in generate
    result = self._sample(
  File "/file/lib/python3.10/site-packages/transformers/generation/utils.py", line 3249, in _sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions

What seems to be the cause? Thanks.

1 Like