GPU usage increasing every loop when running inference

Zawedcvg · August 17, 2023, 2:40pm

I have been trying to run inference on llama 2-13b with the following code on colab.

with torch.no_grad():
for context, multi, single, second in zip(all_items, multi_hop_items, single_hop_items, second_hop_items):

inputs_multi = tokenizer(prompt_multi, return_tensors="pt").to("cuda")
generated_ids_multi = model.generate(**inputs_multi, max_length=4096)

outputs_multi = tokenizer.batch_decode(generated_ids_multi, skip_special_tokens=True)

answer_multi = outputs_multi[0]

There is increase in GPU ram in pretty much everyloop. And once say I stop the cell running in the middle, the GPU used stays at that level, leading me to go out of memory very quickly. i have tried using pipeline on dataset, based on the huggingface website and I seem to be having a very similar issue for that as well.

aschroeder91 · March 2, 2024, 8:12pm

I am having a similar issue and do not have a solution.
I use with torch.inference_mode() and torch.no_grad():
using huggingface transformer library I use start a thread for model.generate for a transformer LLM
After inference is complete my GPU monitor shows that utilization remains >98% for the next 4-10 seconds and then returns to 0%.
Also, consecutive inferences severely throttles generation speed to approximately 10x the generation time

bxrjmfh · May 13, 2024, 3:59am

same problem

Topic		Replies	Views
GPU inference slows down if done in a loop 🤗Transformers	1	1586	July 20, 2020
How can I batch LLaVa inference, so that I can use all of my GPU memory? Beginners	0	1289	January 8, 2024
Memory increasing after hugging face generate method Models	0	45	November 24, 2024
Why is the tensor produced by inference so big? Beginners	2	442	April 17, 2023
Accelerating inference for local HuggingFacePipeline of Llama3 🤗Transformers	0	92	August 1, 2024

GPU usage increasing every loop when running inference

Related topics