Finetuned Donut model taking too much time on local machine for inference , around 5 minutes

Finetuned Donut model is taking 4 minutes 37 seconds for inference on my local windows laptop which has 16GB RAM and 4 cores. However, inference time is under 5 seconds on a google colab CPU machine, it has 32GB RAM. On Colab GPU, the inference time is under a second.

Why it’s taking too much time on my local Windows machine? seems like it’s not a normal behavior. Could someone help and guide me on what could be wrong here?

I am using Transformers Version: 4.28.1, it’s the same on my windows machine as well.

Also, below is the prediction function which I am using and it’s the model.generate method which it taking time.

def run_prediction(image):
    pixel_values = processor(image, return_tensors="pt").pixel_values
    outputs = model.generate(
        pixel_values.to(device),
        decoder_input_ids=decoder_input_ids.to(device),
        max_length=model.decoder.config.max_position_embeddings,
        early_stopping=True,
        pad_token_id=processor.tokenizer.pad_token_id,
        eos_token_id=processor.tokenizer.eos_token_id,
        use_cache=True,
        num_beams=1,
        bad_words_ids=[[processor.tokenizer.unk_token_id]],
        return_dict_in_generate=True)
    
    sequence = processor.batch_decode(outputs.sequences)[0]
    sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
    sequence = re.sub(r"<.*?>", "", sequence, count=1).strip()  # remove first task start token
    return processor.token2json(sequence)

I’d first debug whether it’s due to the generate() method or due to the token2json method (you can leverage the time module of Python for that).

I have checked, it’s due to generate() method.

Hi @shubh1608 : did you find the cause/fix to this problem?