Pipeline vs model.generate()

I want to know whats the difference between using the Pipeline() function to generate a result Vs using the model.generate() function to generate a result, which one is faster? Which one is more accurate? Which one is more consistently giving out good responses? And what is the main difference between them. I am sorry if this sounds like a dumb question i am just wondering which method i should use to generate ML predictions for Summarization, and want to know the Pros/Cons of each of them.

Thanks in advance

7 Likes

Hi,

The pipeline() API is created mostly for people who don’t care too much about the details of the underlying process, for people who just want to use a machine learning model without having to implement several details like pre- and postprocessing themselves. The pipeline API is created such that you get an easy-to-use abstraction over any ML model, which is great for inference. The SummarizationPipeline for instance uses generate() behind the scenes.

On the other hand, if you do care about the details, then it’s recommended to generate text yourself by calling generate() yourself and implement pre-and postprocessing yourself.

Also note that any text generation pipeline does provide a generate_kwargs argument, which means that technically you can forward any of the keyword arguments that generate() supports to the pipeline as well.

15 Likes

Thank you for this response nielsr. This was what I wanted to know.

Hello,

So I tested both recently and found a very peculiar behavior under similar parameter values. This was using Galactica’s 1.3B variant

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, set_seed
import torch

checkpoint = "facebook/galactica-1.3b"

tokenizer = AutoTokenizer.from_pretrained(checkpoint, padding_side="left") 
model = AutoModelForCausalLM.from_pretrained(checkpoint)
model.to('cuda')
generator = pipeline('text-generation', model=model, tokenizer=tokenizer, device=0)

#With pipeline
set_seed(42)
generator(['Is this', 'What is the matter'], renormalize_logits=True, do_sample=True, use_cache=True, max_new_tokens=10)

#With model.generate()
device=torch.device('cuda',0)
model.to(device)

tokenizer = AutoTokenizer.from_pretrained(checkpoint, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token = '<pad>'

tokenized_prompts = tokenizer(['Is this', 'What is the matter'], padding=True, return_tensors='pt')
set_seed(42)
model_op = model.generate(input_ids=tokenized_prompts['input_ids'].to(device),
                          attention_mask=tokenized_prompts['attention_mask'].to(device),
                          renormalize_logits=False, do_sample=True,
                          use_cache=True, max_new_tokens=10)
tokenizer.batch_decode(model_op, skip_special_tokens=True)

Here is the result with each,

[{'generated_text': 'Is this method for dealing with multiple objects?\n\n\n'}],
 [{'generated_text': 'What is the matter density of a star whose radius is equal to '}]
................
['Is this method for dealing with multiple objects?\n\n\n',
 'What is the matter of this, I know that it isn’t']

As we can see, both methods are producing different outputs, even under the same settings. However, the first generation for each method seems to be the same & I tried it for a bunch of other prompts. That being said if we turn off do_sample i.e.

do_sample = False (greedy decoding)

then, we get the same results. Thus, I believe this is related to the sampling method being employed which is producing different results. Does anyone have any thoughts on this?

2 Likes

Hi,

Well, sampling is exactly causing randomness :smiley: you can set a seed to get reproducabile results even when using sampling:

from transformers import set_seed
set_seed(42)

Refer to the generate blog post for more details.

Do you mind sharing a concrete example of what you mean by pre and postprocessing in this context? @nielsr

Thank you in advance.

2 Likes

By pre-processing, I mean turning a sentence into tokens, then turning those tokens into numbers (indices in the vocabulary of a Transformer model). The tokenizer can be used for this purpose, which automatically turns text into so-called input_ids. The pipeline uses a tokenizer behind the scenes.

As for post-processing, one needs to decode the generate id’s back into text. The tokenizer can also be used for this, using the decode or batch_decode methods. The pipeline also makes use of these methods to present the result as text.

2 Likes

Thank you for your response earlier. I have a question regarding the generate_kwargs argument needed to make .generate perform equivalently to .pipeline.

Currently, I am using the model from Meta-Llama-3.1-8B-Instruct-bnb-4bit. When I use .generate, the output begins by repeating the input prompt before generating the desired output. Since my prompt is quite lengthy, I can only see a truncated version of it in the output.

However, when I use .pipeline, it outputs the desired response directly without repeating the prompt. I suspect the difference might be due to .generate using greedy search for decoding, while .pipeline applies additional configurations like penalty terms to avoid regenerating the prompt.

I understand from your response that this might be the case, but I am unsure how to inspect the configuration used by .pipeline and apply similar settings to the model.generation_config. Could you provide an example code snippet illustrating how to achieve this?

Thank you for your help!

1 Like

@nielsr sry, forgot to @

1 Like