Using BetterTransformer is slower than not using it

I am using a fine-tuned distilgpt2 model for inference. When using the text-generation pipeline with accelerator="bettertransformer" the generation slows down

Here is the relevant code I am using:

model = AutoModelForCausalLM.from_pretrained(args.model_path,
                                             device_map="auto",
)

tokenizer = GPT2TokenizerFast.from_pretrained(args.tokenizer_path)
tokenizer.add_special_tokens({"eos_token": "[EOS]"})

inputs = tokenizer(args.prompt, return_tensors="pt").to('cuda')

from optimum.pipelines import pipeline

if args.total_fingerprints is None:
    args.total_fingerprints = args.num_sequences

total_iterations = args.total_fingerprints // args.num_sequences

start = time.time()
n_tokens = 0
print("Generating...")
output_list = []
for _ in range(total_iterations):
    with torch.inference_mode():
        with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=False):
            generator = pipeline("text-generation", model=model, tokenizer=tokenizer, accelerator="bettertransformer")

            outputs = generator(args.prompt, do_sample=True,
                                eos_token_id=tokenizer.eos_token_id,
                                pad_token_id=tokenizer.eos_token_id,
                                bad_words_ids=bad_words,
                                num_return_sequences=args.num_sequences,
                                max_length=args.max_length,
                                top_k=args.top_k,
                                top_p=args.top_p,
            )

            #outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
            for output in outputs:
                output = output['generated_text']
                
                output_list.append(output)
                n_tokens += len(tokenizer.encode(output))
                print(output)
                print("-" * 100)

duration = (time.time() - start) * 1000
print(f"Generated {n_tokens} tokens")
print(f"Took {duration / n_tokens}ms per token")
print(f"Generated {len(output_list)} sequnces")
print(f"Took {duration / len(output_list)}ms per sequence")

With accelerate="bettertransformer":

Generated 67861 tokens
Took 0.5158231439013002ms per token
Generated 100 sequences
Took 350.0427436828613ms per sequence

And without bettertransformer:

Generated 67844 tokens
Took 0.46539602537747227ms per token
Generated 100 sequences
Took 315.7432794570923ms per sequence

Am I doing something wrong or is speedup not always expected?