Running out of memory at inference

Hi,

I finetuned a paraphraser model, based on https://huggingface.co/tuner007/pegasus_paraphrase

However when I’m trying to make predictions, I get CUDA out of memory. I’m using the same server I used to train the model, so I should be ok.

I’m trying to make prediction on a list of sentences. The list is about 150 sentences long. So I’m assuming it is running out of memory because it needs batching?

Is there a way to specify batch size at inference?

I’m using this function to generate output:

def get_responses(input_text_list, num_return_sequences):

    batch = tokenizer(
        input_text_list, 
        truncation=True,
        padding='longest',
        max_length=60, 
        return_tensors="pt"
        ).to(device)

    translated = model.generate(
        **batch, 
        max_length=60, 
        num_beams=num_return_sequences, #use same as number of sequences for variety
        num_return_sequences=num_return_sequences, 
        temperature=1.5)

    #generates the full output list of all results
    output_list = tokenizer.batch_decode(translated, skip_special_tokens=True)

 
    return output_list

Is there a proper way to split the list into smaller batches? Or do I have to write separate code to break list of incoming values into batches?