Is there any difference in latency between the general class pipeline
and a more task-specific implementation? For example, between these code blocks, is there a speed difference? What about with batching, does that change things?
from transformers import pipeline
generator = pipeline(model="gpt2")
generator("I can't believe you did such a ", do_sample=False)
# These parameters will return suggestions, and only the newly created text making it easier for prompting suggestions.
outputs = generator("My tart needs some")
The other method would be:
from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
Moreover, is there any records on the fastest way to run HF models generally? Should they be exported somewhere else for optimization?