GPU memory not being freed between batches

Hi,
While training bert model for masked language modelling, I am getting CUDA OOM. I am unable to train for more than one batch, even with batch size 1. I used torch.cuda.memory_summary() to monitor GPU usage between batches. Below is my code and the memory output.
I want know why, despite deleting variables, the current memory usage is still so high? Are there any other variables which are connected to the computational graph? How can I free it up?
Code-

for epoch in range(epochs):
  loop = tqdm(loader, leave=True)
  for batch in loader:
      print("Memory before:")
      print(torch.cuda.memory_summary(0))
  
      optim.zero_grad()
      ids= apply_mask(torch.stack(batch.input_ids).t())
      input_ids = ids.to(device)
      attention_mask = torch.stack(batch.attention_mask).t().to(device)
      labels = batch.labels.to(device)
      print("data loaded")
  
      outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
      print("output")
  
      loss = outputs.loss
      lgits= outputs.logits
      loss.backward(retain_graph=False)
      print("backprop")
      optim.step()
  
      del loss, lgits, outputs, input_ids, attention_mask, labels, ids

Output of memory-

Memory before:
|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |    1479 MB |    1479 MB |    1479 MB |       0 B  |
|       from large pool |    1477 MB |    1477 MB |    1477 MB |       0 B  |
|       from small pool |       2 MB |       2 MB |       2 MB |       0 B  |
|---------------------------------------------------------------------------|
| Active memory         |    1479 MB |    1479 MB |    1479 MB |       0 B  |
|       from large pool |    1477 MB |    1477 MB |    1477 MB |       0 B  |
|       from small pool |       2 MB |       2 MB |       2 MB |       0 B  |
|---------------------------------------------------------------------------|
| GPU reserved memory   |    1492 MB |    1492 MB |    1492 MB |       0 B  |
|       from large pool |    1488 MB |    1488 MB |    1488 MB |       0 B  |
|       from small pool |       4 MB |       4 MB |       4 MB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |   12591 KB |   22137 KB |   43513 KB |   30922 KB |
|       from large pool |   10978 KB |   21346 KB |   39490 KB |   28512 KB |
|       from small pool |    1612 KB |    2021 KB |    4023 KB |    2410 KB |
|---------------------------------------------------------------------------|
| Allocations           |     204    |     204    |     204    |       0    |
|       from large pool |      26    |      26    |      26    |       0    |
|       from small pool |     178    |     178    |     178    |       0    |
|---------------------------------------------------------------------------|
| Active allocs         |     204    |     204    |     204    |       0    |
|       from large pool |      26    |      26    |      26    |       0    |
|       from small pool |     178    |     178    |     178    |       0    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       6    |       6    |       6    |       0    |
|       from large pool |       4    |       4    |       4    |       0    |
|       from small pool |       2    |       2    |       2    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       5    |       5    |       5    |       0    |
|       from large pool |       3    |       3    |       3    |       0    |
|       from small pool |       2    |       2    |       2    |       0    |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|
data loaded
output
backprop
Memory before:
|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |    5916 MB |    8107 MB |   13730 MB |    7814 MB |
|       from large pool |    5906 MB |    8102 MB |   13649 MB |    7742 MB |
|       from small pool |       9 MB |      29 MB |      81 MB |      71 MB |
|---------------------------------------------------------------------------|
| Active memory         |    5916 MB |    8107 MB |   13730 MB |    7814 MB |
|       from large pool |    5906 MB |    8102 MB |   13649 MB |    7742 MB |
|       from small pool |       9 MB |      29 MB |      81 MB |      71 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |   10654 MB |   10654 MB |   10654 MB |       0 B  |
|       from large pool |   10624 MB |   10624 MB |   10624 MB |       0 B  |
|       from small pool |      30 MB |      30 MB |      30 MB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |  767367 KB |     850 MB |    1021 MB |  278738 KB |
|       from large pool |  750674 KB |     848 MB |     925 MB |  197055 KB |
|       from small pool |   16693 KB |      21 MB |      96 MB |   81683 KB |
|---------------------------------------------------------------------------|
| Allocations           |     810    |     817    |    1671    |     861    |
|       from large pool |     104    |     106    |     136    |      32    |
|       from small pool |     706    |     712    |    1535    |     829    |
|---------------------------------------------------------------------------|
| Active allocs         |     810    |     817    |    1671    |     861    |
|       from large pool |     104    |     106    |     136    |      32    |
|       from small pool |     706    |     712    |    1535    |     829    |
|---------------------------------------------------------------------------|
| GPU reserved segments |      27    |      27    |      27    |       0    |
|       from large pool |      12    |      12    |      12    |       0    |
|       from small pool |      15    |      15    |      15    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      26    |      48    |     404    |     378    |
|       from large pool |       7    |       8    |      10    |       3    |
|       from small pool |      19    |      42    |     394    |     375    |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|