I am using a fine-tuned distilgpt2
model for inference. When using the text-generation pipeline with accelerator="bettertransformer"
the generation slows down
Here is the relevant code I am using:
model = AutoModelForCausalLM.from_pretrained(args.model_path,
device_map="auto",
)
tokenizer = GPT2TokenizerFast.from_pretrained(args.tokenizer_path)
tokenizer.add_special_tokens({"eos_token": "[EOS]"})
inputs = tokenizer(args.prompt, return_tensors="pt").to('cuda')
from optimum.pipelines import pipeline
if args.total_fingerprints is None:
args.total_fingerprints = args.num_sequences
total_iterations = args.total_fingerprints // args.num_sequences
start = time.time()
n_tokens = 0
print("Generating...")
output_list = []
for _ in range(total_iterations):
with torch.inference_mode():
with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=False):
generator = pipeline("text-generation", model=model, tokenizer=tokenizer, accelerator="bettertransformer")
outputs = generator(args.prompt, do_sample=True,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.eos_token_id,
bad_words_ids=bad_words,
num_return_sequences=args.num_sequences,
max_length=args.max_length,
top_k=args.top_k,
top_p=args.top_p,
)
#outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for output in outputs:
output = output['generated_text']
output_list.append(output)
n_tokens += len(tokenizer.encode(output))
print(output)
print("-" * 100)
duration = (time.time() - start) * 1000
print(f"Generated {n_tokens} tokens")
print(f"Took {duration / n_tokens}ms per token")
print(f"Generated {len(output_list)} sequnces")
print(f"Took {duration / len(output_list)}ms per sequence")
With accelerate="bettertransformer"
:
Generated 67861 tokens
Took 0.5158231439013002ms per token
Generated 100 sequences
Took 350.0427436828613ms per sequence
And without bettertransformer:
Generated 67844 tokens
Took 0.46539602537747227ms per token
Generated 100 sequences
Took 315.7432794570923ms per sequence
Am I doing something wrong or is speedup not always expected?