Pipeline vs model.generate()

I want to know whats the difference between using the Pipeline() function to generate a result Vs using the model.generate() function to generate a result, which one is faster? Which one is more accurate? Which one is more consistently giving out good responses? And what is the main difference between them. I am sorry if this sounds like a dumb question i am just wondering which method i should use to generate ML predictions for Summarization, and want to know the Pros/Cons of each of them.

Thanks in advance

3 Likes

Hi,

The pipeline() API is created mostly for people who don’t care too much about the details of the underlying process, for people who just want to use a machine learning model without having to implement several details like pre- and postprocessing themselves. The pipeline API is created such that you get an easy-to-use abstraction over any ML model, which is great for inference. The SummarizationPipeline for instance uses generate() behind the scenes.

On the other hand, if you do care about the details, then it’s recommended to generate text yourself by calling generate() yourself and implement pre-and postprocessing yourself.

Also note that any text generation pipeline does provide a generate_kwargs argument, which means that technically you can forward any of the keyword arguments that generate() supports to the pipeline as well.

5 Likes

Thank you for this response nielsr. This was what I wanted to know.

Hello,

So I tested both recently and found a very peculiar behavior under similar parameter values. This was using Galactica’s 1.3B variant

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, set_seed
import torch

checkpoint = "facebook/galactica-1.3b"

tokenizer = AutoTokenizer.from_pretrained(checkpoint, padding_side="left") 
model = AutoModelForCausalLM.from_pretrained(checkpoint)
model.to('cuda')
generator = pipeline('text-generation', model=model, tokenizer=tokenizer, device=0)

#With pipeline
set_seed(42)
generator(['Is this', 'What is the matter'], renormalize_logits=True, do_sample=True, use_cache=True, max_new_tokens=10)

#With model.generate()
device=torch.device('cuda',0)
model.to(device)

tokenizer = AutoTokenizer.from_pretrained(checkpoint, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token = '<pad>'

tokenized_prompts = tokenizer(['Is this', 'What is the matter'], padding=True, return_tensors='pt')
set_seed(42)
model_op = model.generate(input_ids=tokenized_prompts['input_ids'].to(device),
                          attention_mask=tokenized_prompts['attention_mask'].to(device),
                          renormalize_logits=False, do_sample=True,
                          use_cache=True, max_new_tokens=10)
tokenizer.batch_decode(model_op, skip_special_tokens=True)

Here is the result with each,

[{'generated_text': 'Is this method for dealing with multiple objects?\n\n\n'}],
 [{'generated_text': 'What is the matter density of a star whose radius is equal to '}]
................
['Is this method for dealing with multiple objects?\n\n\n',
 'What is the matter of this, I know that it isn’t']

As we can see, both methods are producing different outputs, even under the same settings. However, the first generation for each method seems to be the same & I tried it for a bunch of other prompts. That being said if we turn off do_sample i.e.

do_sample = False (greedy decoding)

then, we get the same results. Thus, I believe this is related to the sampling method being employed which is producing different results. Does anyone have any thoughts on this?

Hi,

Well, sampling is exactly causing randomness :smiley: you can set a seed to get reproducabile results even when using sampling:

from transformers import set_seed
set_seed(42)

Refer to the generate blog post for more details.