I am finetuning a XGLM-7.5B as a translation model using transformers 4.26.0 + deepspeed 0.8.0. The generation time for 1000 sentences of the original XGLM model downloaded from huggingface is about 8 minutes. However, when the finetuned model takes about 15 minutes to finish the generation. Could anyone tell me how to fix this problem?
Code snippets that may be related:
loading model(either pretrained or finetuned)
tokenizer.padding_side = 'left' model = AutoModelForCausalLM.from_pretrained(args.model_name_or_path).cuda().half()
with torch.no_grad(): generated_ids = model.generate( **encoding, max_new_tokens=100, num_beams=4, early_stopping=True, )