Finetuned model takes double inference time

I am finetuning a XGLM-7.5B as a translation model using transformers 4.26.0 + deepspeed 0.8.0. The generation time for 1000 sentences of the original XGLM model downloaded from huggingface is about 8 minutes. However, when the finetuned model takes about 15 minutes to finish the generation. Could anyone tell me how to fix this problem?

Code snippets that may be related:
loading model(either pretrained or finetuned)

tokenizer.padding_side = 'left'
model = AutoModelForCausalLM.from_pretrained(args.model_name_or_path).cuda().half()


with torch.no_grad():
            generated_ids = model.generate(