Inference speed between pipelines and Heads

brosand · April 3, 2023, 3:07pm

Is there any difference in latency between the general class pipeline and a more task-specific implementation? For example, between these code blocks, is there a speed difference? What about with batching, does that change things?

from transformers import pipeline

generator = pipeline(model="gpt2")
generator("I can't believe you did such a ", do_sample=False)

# These parameters will return suggestions, and only the newly created text making it easier for prompting suggestions.
outputs = generator("My tart needs some")

The other method would be:

from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

Moreover, is there any records on the fastest way to run HF models generally? Should they be exported somewhere else for optimization?

Topic		Replies	Views
Reducing latency for GPT-J Beginners	9	2443	December 18, 2022
Difference between pipeline and model.generate? 🤗Transformers	2	2524	February 26, 2024
Using BetterTransformer is slower than not using it Beginners	0	161	January 17, 2024
Does batching in the standard question-answering pipeline provide a speedup? Intermediate	1	1471	December 13, 2021
Asynchronous CPU-GPU computation Beginners	0	346	March 15, 2024

Inference speed between pipelines and Heads

Related topics