Error using datasets with pipeline for text generation

I am performing inference with llama-3-8b for the purposes of text generation. If I only pass 1 prompt at a time, my code works. However, since I have a for loop that loops over 500 prompts and calling the model for each prompt, hugging face gave me the following warning:

UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset

I have thus tested out the suggestion in the warning but I am getting an error and I can’t figure out what’s the mistake. Here’s a snippet of my code:

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    pipeline
)
import torch
from datasets import Dataset
from transformers.pipelines.pt_utils import KeyDataset


model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    load_in_8bit=True,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    temperature=1
)

message = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]
messages = [message, message]
dataset_messages = {"messages": messages}
messages_dataset = Dataset.from_dict(dataset_messages)

sequences = pipe(
    KeyDataset(messages_dataset,"messages"),
    max_new_tokens=200,
    do_sample=True,
    return_full_text=False,
    top_k=1
)

for seq in sequences:
    print(f"Result: {seq['generated_text']}")

However, I get the following error:

text_generation.py", line 266, in preprocess
prefix + prompt_text,
~^~~~~~~
TypeError: can only concatenate str (not “list”) to str

when I call pipe on the datasets object. My datasets version is 2.18.0, I am aware of this issue but I believe this is not due to the update anymore. Can anyone help me figure out if I used datasets correctly?

Thank you so much!

1 Like

Hi there, were you able to solve this in some way? I have the same problem (datasets version 2.19.1)

1 Like

Perhaps this?

I have a running code which is similar to the answer proposed to that stack overflow question (i.e. a function that returns a model response using a huggingface pipeline and has one pd row as input and then applying that function on the whole pd df).

However, I would like to test if directly using an iterable of the whole data (e.g. a KeyDataset object) as input to the pipeline, which then enables batching, speeds things up. But that’s where I get the same error as OP

1 Like
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    pipeline
)
import torch
from datasets import Dataset
from transformers.pipelines.pt_utils import KeyDataset

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    load_in_8bit=True,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

# Create pipeline for text generation
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    temperature=1
)

# Define the messages
message = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"}
]

# Create a dataset with strings, not lists of messages
dataset_messages = {"messages": [f"{msg['content']}" for msg in message] * 2}  # Repeat for 2 entries
messages_dataset = Dataset.from_dict(dataset_messages)

# Use KeyDataset to pass to the pipeline
sequences = pipe(
    KeyDataset(messages_dataset, "messages"),
    max_new_tokens=200,
    do_sample=True,
    return_full_text=False,
    top_k=1
)

# Print results
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

1 Like

The error occurs because the pipeline expects strings, not lists. Flatten the message dictionary to a string for each prompt before passing it to the dataset.

1 Like