Pipeline vs model.generate()

Z3K3 · November 16, 2022, 10:12pm

I want to know whats the difference between using the Pipeline() function to generate a result Vs using the model.generate() function to generate a result, which one is faster? Which one is more accurate? Which one is more consistently giving out good responses? And what is the main difference between them. I am sorry if this sounds like a dumb question i am just wondering which method i should use to generate ML predictions for Summarization, and want to know the Pros/Cons of each of them.

Thanks in advance

nielsr · November 17, 2022, 8:01am

Hi,

The pipeline() API is created mostly for people who don’t care too much about the details of the underlying process, for people who just want to use a machine learning model without having to implement several details like pre- and postprocessing themselves. The pipeline API is created such that you get an easy-to-use abstraction over any ML model, which is great for inference. The SummarizationPipeline for instance uses generate() behind the scenes.

On the other hand, if you do care about the details, then it’s recommended to generate text yourself by calling generate() yourself and implement pre-and postprocessing yourself.

Also note that any text generation pipeline does provide a generate_kwargs argument, which means that technically you can forward any of the keyword arguments that generate() supports to the pipeline as well.

Z3K3 · November 17, 2022, 5:40pm

Thank you for this response nielsr. This was what I wanted to know.

Saptarshi7 · August 16, 2023, 9:45pm

Hello,

So I tested both recently and found a very peculiar behavior under similar parameter values. This was using Galactica’s 1.3B variant

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, set_seed
import torch

checkpoint = "facebook/galactica-1.3b"

tokenizer = AutoTokenizer.from_pretrained(checkpoint, padding_side="left") 
model = AutoModelForCausalLM.from_pretrained(checkpoint)
model.to('cuda')
generator = pipeline('text-generation', model=model, tokenizer=tokenizer, device=0)

#With pipeline
set_seed(42)
generator(['Is this', 'What is the matter'], renormalize_logits=True, do_sample=True, use_cache=True, max_new_tokens=10)

#With model.generate()
device=torch.device('cuda',0)
model.to(device)

tokenizer = AutoTokenizer.from_pretrained(checkpoint, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token = '<pad>'

tokenized_prompts = tokenizer(['Is this', 'What is the matter'], padding=True, return_tensors='pt')
set_seed(42)
model_op = model.generate(input_ids=tokenized_prompts['input_ids'].to(device),
                          attention_mask=tokenized_prompts['attention_mask'].to(device),
                          renormalize_logits=False, do_sample=True,
                          use_cache=True, max_new_tokens=10)
tokenizer.batch_decode(model_op, skip_special_tokens=True)

Here is the result with each,

[{'generated_text': 'Is this method for dealing with multiple objects?\n\n\n'}],
 [{'generated_text': 'What is the matter density of a star whose radius is equal to '}]
................
['Is this method for dealing with multiple objects?\n\n\n',
 'What is the matter of this, I know that it isn’t']

As we can see, both methods are producing different outputs, even under the same settings. However, the first generation for each method seems to be the same & I tried it for a bunch of other prompts. That being said if we turn off do_sample i.e.

do_sample = False (greedy decoding)

then, we get the same results. Thus, I believe this is related to the sampling method being employed which is producing different results. Does anyone have any thoughts on this?

nielsr · December 25, 2023, 8:59pm

Hi,

Well, sampling is exactly causing randomness you can set a seed to get reproducabile results even when using sampling:

from transformers import set_seed
set_seed(42)

Refer to the generate blog post for more details.

brando · December 5, 2024, 7:26pm

Do you mind sharing a concrete example of what you mean by pre and postprocessing in this context? @nielsr

Thank you in advance.

nielsr · December 29, 2024, 11:07am

By pre-processing, I mean turning a sentence into tokens, then turning those tokens into numbers (indices in the vocabulary of a Transformer model). The tokenizer can be used for this purpose, which automatically turns text into so-called input_ids. The pipeline uses a tokenizer behind the scenes.

As for post-processing, one needs to decode the generate id’s back into text. The tokenizer can also be used for this, using the decode or batch_decode methods. The pipeline also makes use of these methods to present the result as text.

hongyeliu · January 20, 2025, 2:24am

Thank you for your response earlier. I have a question regarding the generate_kwargs argument needed to make .generate perform equivalently to .pipeline.

Currently, I am using the model from Meta-Llama-3.1-8B-Instruct-bnb-4bit. When I use .generate, the output begins by repeating the input prompt before generating the desired output. Since my prompt is quite lengthy, I can only see a truncated version of it in the output.

However, when I use .pipeline, it outputs the desired response directly without repeating the prompt. I suspect the difference might be due to .generate using greedy search for decoding, while .pipeline applies additional configurations like penalty terms to avoid regenerating the prompt.

I understand from your response that this might be the case, but I am unsure how to inspect the configuration used by .pipeline and apply similar settings to the model.generation_config. Could you provide an example code snippet illustrating how to achieve this?

Thank you for your help!

hongyeliu · February 17, 2025, 3:11pm

@nielsr sry, forgot to @

Bendang · July 5, 2025, 1:50pm

I am having the same problem. Have you figured out how to do this?

John6666 · July 6, 2025, 3:55am

For now, I think the default value in Pipeline is prioritized by generation_config.json, followed by the default value in GenerationConfig. If you reproduce this, you should get almost the same result. Probably like this:

outputs = model.generate(input_ids, do_sample=True, top_k=50, top_p=0.9, temperature=0.6,  repetition_penalty=1.0,  max_length=131072,  bos_token_id=128000, pad_token_id=128004, eos_token_id=[128001, 128008, 128009])

Bendang · July 16, 2025, 4:28pm

I found a workaround to make model.generate produce the same output as the pipeline. I ran the pipeline in debug mode and set a breakpoint here. At that point, I pickled the generate_kwargs used internally by the pipeline and reused them directly in my own call to model.generate. This way, I was able to replicate the exact same output as the pipeline.
Hope this helps anyone facing a similar issue.

Topic		Replies	Views
Difference between pipeline and model.generate? 🤗Transformers	2	2631	February 26, 2024
Different Summary Outputs Locally vs API for the Same Text Amazon SageMaker	7	2073	December 6, 2021
Inconsistent Model/Pipeline Behavior using Automodel/Pipeline/BartForConditionalGeneration 🤗Transformers	3	894	February 16, 2021
Text generation AI models generating repeated/duplicate text/sentences. What am I doing incorrectly? Hugging face models - Meta GALACTICA 🤗Transformers	1	1131	January 16, 2023
Is model.generate() default is to summarize? Beginners	0	154	July 22, 2023

Pipeline vs model.generate()

Related topics