Finetuned model takes double inference time

Saltychtao · March 2, 2023, 2:41am

I am finetuning a XGLM-7.5B as a translation model using transformers 4.26.0 + deepspeed 0.8.0. The generation time for 1000 sentences of the original XGLM model downloaded from huggingface is about 8 minutes. However, when the finetuned model takes about 15 minutes to finish the generation. Could anyone tell me how to fix this problem?

Code snippets that may be related:
loading model(either pretrained or finetuned)

tokenizer.padding_side = 'left'
model = AutoModelForCausalLM.from_pretrained(args.model_name_or_path).cuda().half()

generation

with torch.no_grad():
            generated_ids = model.generate(
                **encoding,
                max_new_tokens=100,
                num_beams=4,
                early_stopping=True,
                )

Topic		Replies	Views
Finetuned Donut model taking too much time on local machine for inference , around 5 minutes 🤗Transformers	3	948	January 4, 2024
Extremely slow init of fine-tuned model Beginners	0	275	February 9, 2024
Slow inference for translation Beginners	0	178	April 22, 2024
Language-modeling script "killed" when fine-tuning gpt2-medium Beginners	3	6893	May 19, 2023
Inference for a 7B model on A100 takes too long? Beginners	1	1656	March 15, 2024

Finetuned model takes double inference time

Related topics