I am finetuning a XGLM-7.5B as a translation model using transformers 4.26.0 + deepspeed 0.8.0. The generation time for 1000 sentences of the original XGLM model downloaded from huggingface is about 8 minutes. However, when the finetuned model takes about 15 minutes to finish the generation. Could anyone tell me how to fix this problem?
Code snippets that may be related:
loading model(either pretrained or finetuned)
tokenizer.padding_side = 'left'
model = AutoModelForCausalLM.from_pretrained(args.model_name_or_path).cuda().half()
generation
with torch.no_grad():
generated_ids = model.generate(
**encoding,
max_new_tokens=100,
num_beams=4,
early_stopping=True,
)