Unable to free whole GPU memory even after ``del var; gc.collect; empty_cache()``

I am inferencing on decoder based models. The issue is I am deleting all Tensors, models but still GPU memory is not freed. I suspect something is going in .generate() method which is occupying the memory. Here’s the training loop and nowhere else Tensors are forming (except for model initialization)

for iteration_idx in tqdm(range(len(data)//batch_size), total=len(data)//batch_size):
    
    batch = get_batch(data)
    encodeds = tokenizer.apply_chat_template(batch, return_tensors="pt", add_generation_prompt=True, padding=True)
    model_inputs = encodeds.to(device)

    try:
        with torch.no_grad():
            generated_ids = model.generate(model_inputs, max_new_tokens=5000, do_sample=False, temperature=0, use_cache=False)
    except Exception as e2:
        print(e2)
        failed.append([e2, iteration_idx, idx])
        model_inputs = model_inputs.cpu().detach()
        del model_inputs
        gc.collect()
        empty_cache()
        continue

    model_inputs = model_inputs.cpu().detach()
    generated_ids = generated_ids.cpu().detach()
    # print(model_inputs.shape)
    decoded = tokenizer.batch_decode(generated_ids[:, model_inputs.shape[1]:], skip_special_tokens=True)
    
    outputs += decoded
    del generated_ids, model_inputs
    gc.collect()
    empty_cache()

After every iteration, the memory is growing by ~2GB but not clearing up by the functions

EDIT: Every iteration after first OOM is contributing to ~2GB overhead and not clearing up even after using everything mentioned in the code

EDIT2: I tried to delete model in the exception after putting it back to .cpu(), memory is cleared more but not going zero (0)

1 Like

You’re properly cpu offloading and del the model before gc and paying attention to the tensor behaviour. The only way I can think of to do it a bit better than this is to get the state_dict and pop it all.
If there’s anything suspicious about the code, it’s tqdm. I think there was a bug that prevented it from occupying or freeing memory under certain conditions.

If it’s not the code itself, it’s the libraries, preferences, or hardware…
It usually has nothing to do with freeing RAM. For example, some BIOS settings can cause multi-GPUs to malfunction, but it has nothing to do with VRAM freeing.

The OS is also a bit suspect. It’s possible that the OS is leaving RAM reserved because it will be used later. Like page files.

Thanks for the response. Appreciated .

I tried to

  1. Remove tqdm but nothing happened

  2. Loaded model on the cpu and checked the state_dict, everything was offloaded onto cpu (expected). I didn’t get the meaning of popping from state_dict. I am in eval/no_grad mode and only performing generation and as written after OOM in generation the memory increases by 2GB per iteration

EDIT: I reinit the model in the same except block after deleting it in the loop as shown and try to delete it again to see if the rouge memory is used by the model variable or any other tensor, but as soon as I hit run after model = model.cpu(), it runs for few seconds and then the process gets automatically killed

EDIT2: I replaced the old model variable by transferring it to cpu() and reassigning it and moving to GPU. Tried to move it back to cpu but the process got killed again

The torch tensor, as you know, is very persistent and stubborn, so if there are any small dependencies left, the tensor won’t go away, but I don’t think there are any remaining dependencies in your code.

I didn’t get the meaning of popping from state_dict.

Since state_dict can literally be treated as a Python dictionary type, the tensor will eventually disappear if you .pop() it and discard the value obtained from it. I don’t know how the internal implementation works either, but I use this method when I want to shave off even a single bit of RAM consumption as quickly as possible, even if it is only a nanosecond. RAM consumption is actually reduced immediately.
However, this operation is usually absolutely unnecessary unless you are in a RAM-starved environment…but in this case, it may be some kind of bug, so this operation may make sense.

The version that I tried my best to erase as much as I could. Also, there was a possibility that a new instance might have been created by the assignment statement, so I removed it.

for iteration_idx in tqdm(range(len(data)//batch_size), total=len(data)//batch_size):
    
    batch = get_batch(data)
    encodeds = tokenizer.apply_chat_template(batch, return_tensors="pt", add_generation_prompt=True, padding=True)
    model_inputs = encodeds.to(device)

    try:
        with torch.no_grad():
            generated_ids = model.generate(model_inputs, max_new_tokens=5000, do_sample=False, temperature=0, use_cache=False)
    except Exception as e2:
        print(e2)
        failed.append([e2, iteration_idx, idx])
        model_inputs = model_inputs.cpu().detach()
        del model_inputs
        gc.collect()
        empty_cache()
        continue

    encodeds.cpu()
    model_inputs.cpu()
    generated_ids.cpu()
    # print(model_inputs.shape)
    decoded = tokenizer.batch_decode(generated_ids[:, model_inputs.shape[1]:], skip_special_tokens=True)
    
    outputs += decoded
    del generated_ids, model_inputs, encodeds
    gc.collect()
    empty_cache()

Thanks man!! I’ve tried this code but no luck. Shall I post this on their GitHub issues?

One more query, I am aware about the state_dict of a model but what to pop? I haven’t really performed any operations of state_dict. Can you share some code of how you freed every bit of RAM?

UPDATE: What I did is created a dummy script where I am loading the same model and a very big tensor (For OOM). I am loading both to gpu and then loading then model back to the cpu (not deleting it, just using empty_cache() after) and same for the big tensor.

It is working fine in this case, all of the memory is freed but not in the code above in the thread

Shall I post this on their GitHub issues?

I don’t have a github account, so if the original code is out there somewhere I’d appreciate you letting the author know.
If their environment is rich or virtual, it’s hard to notice RAM release trouble.

Can you share some code of how you freed every bit of RAM?

I’m not an expert on Python and AI either, so don’t put too much trust in me as I just did it while searching.
The method of loading from the model may be wrong or the function may be different, so be careful there. Usually, it is done for the ones loaded by pytorch or safetensors libraries.

def clear_sd(sd: dict):
    for k in list(sd.keys()):
        sd.pop(k)
    del sd
    torch.cuda.empty_cache()
    gc.collect()

state_dict = model.state_dict() # this function available on torch model and it's inheritance models
clear_sd(state_dict)
1 Like

Thanks man for the help, I need to figure this thing out. Of what I am experiencing is that in a single batch OOM is not coming, which makes me wonder if something in generate needs to be un-cached.

Appreciate your help

For the state_dict(), thise are model weights and essentially by deleting them you, re deleting weights which is almost equivalent to del model. As you are on CPU so you can see the complete effect and their is no blockage AFAIK.

If I found a solution to this, I’ll update

1 Like

It might be some kind of big bug, so good luck.
I think it’s something weird by design. If you were a library author, for example, you wouldn’t want this kind of spec.

When I use CUDA I too offload to the CPU and then delete the model. So why not put it in that section?
I have no idea why, but even if the tensor won’t disappear with just del, it will disappear relatively early this way.
This is just a bad know how I stole from some of HF’s Diffusers library.

The fact that there is bad know-how means that there is or used to be a problem somewhere that cannot be solved by regular means.

1 Like