Hi,
I finetuned a paraphraser model, based on https://huggingface.co/tuner007/pegasus_paraphrase
However when I’m trying to make predictions, I get CUDA out of memory
. I’m using the same server I used to train the model, so I should be ok.
I’m trying to make prediction on a list of sentences. The list is about 150 sentences long. So I’m assuming it is running out of memory because it needs batching?
Is there a way to specify batch size at inference?
I’m using this function to generate output:
def get_responses(input_text_list, num_return_sequences):
batch = tokenizer(
input_text_list,
truncation=True,
padding='longest',
max_length=60,
return_tensors="pt"
).to(device)
translated = model.generate(
**batch,
max_length=60,
num_beams=num_return_sequences, #use same as number of sequences for variety
num_return_sequences=num_return_sequences,
temperature=1.5)
#generates the full output list of all results
output_list = tokenizer.batch_decode(translated, skip_special_tokens=True)
return output_list
Is there a proper way to split the list into smaller batches? Or do I have to write separate code to break list of incoming values into batches?