Finetuned Donut model is taking 4 minutes 37 seconds for inference on my local windows laptop which has 16GB RAM and 4 cores. However, inference time is under 5 seconds on a google colab CPU machine, it has 32GB RAM. On Colab GPU, the inference time is under a second.
Why it’s taking too much time on my local Windows machine? seems like it’s not a normal behavior. Could someone help and guide me on what could be wrong here?
I am using Transformers Version: 4.28.1, it’s the same on my windows machine as well.
Also, below is the prediction function which I am using and it’s the model.generate method which it taking time.
def run_prediction(image):
pixel_values = processor(image, return_tensors="pt").pixel_values
outputs = model.generate(
pixel_values.to(device),
decoder_input_ids=decoder_input_ids.to(device),
max_length=model.decoder.config.max_position_embeddings,
early_stopping=True,
pad_token_id=processor.tokenizer.pad_token_id,
eos_token_id=processor.tokenizer.eos_token_id,
use_cache=True,
num_beams=1,
bad_words_ids=[[processor.tokenizer.unk_token_id]],
return_dict_in_generate=True)
sequence = processor.batch_decode(outputs.sequences)[0]
sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
sequence = re.sub(r"<.*?>", "", sequence, count=1).strip() # remove first task start token
return processor.token2json(sequence)