I have been trying to run inference on llama 2-13b with the following code on colab.
with torch.no_grad():
for context, multi, single, second in zip(all_items, multi_hop_items, single_hop_items, second_hop_items):
inputs_multi = tokenizer(prompt_multi, return_tensors="pt").to("cuda")
generated_ids_multi = model.generate(**inputs_multi, max_length=4096)
outputs_multi = tokenizer.batch_decode(generated_ids_multi, skip_special_tokens=True)
answer_multi = outputs_multi[0]
There is increase in GPU ram in pretty much everyloop. And once say I stop the cell running in the middle, the GPU used stays at that level, leading me to go out of memory very quickly. i have tried using pipeline on dataset, based on the huggingface website and I seem to be having a very similar issue for that as well.