GPU usage increasing every loop when running inference

I have been trying to run inference on llama 2-13b with the following code on colab.

with torch.no_grad():
for context, multi, single, second in zip(all_items, multi_hop_items, single_hop_items, second_hop_items):

inputs_multi = tokenizer(prompt_multi, return_tensors="pt").to("cuda")
generated_ids_multi = model.generate(**inputs_multi, max_length=4096)

outputs_multi = tokenizer.batch_decode(generated_ids_multi, skip_special_tokens=True)

answer_multi = outputs_multi[0]

There is increase in GPU ram in pretty much everyloop. And once say I stop the cell running in the middle, the GPU used stays at that level, leading me to go out of memory very quickly. i have tried using pipeline on dataset, based on the huggingface website and I seem to be having a very similar issue for that as well.

1 Like

I am having a similar issue and do not have a solution.
I use with torch.inference_mode() and torch.no_grad():
using huggingface transformer library I use start a thread for model.generate for a transformer LLM
After inference is complete my GPU monitor shows that utilization remains >98% for the next 4-10 seconds and then returns to 0%.
Also, consecutive inferences severely throttles generation speed to approximately 10x the generation time

same problem