Memory continuously increasing during `compute_loss()`

When I try to define a Customized Trainer, memory is continuously increasing during training:

Please note that I comment the @torch.no_grad() and add torch.set_grad_enabled(True) in the def generate().

class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")

        # forward pass
        outputs = model(**inputs)
        logits = outputs.get("logits")

        # compute custom loss (suppose one has 3 labels with different weights)
        loss_fct = nn.CrossEntropyLoss(weight=torch.tensor([1.0, 2.0, 3.0], device=model.device))
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

The memory usage is continuously increasing and an CUDA out-of-memory error occurred.

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
/usr4/ec523/brucejia/.local/lib/python3.8/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 3 has a total capacty of 44.35 GiB of which 5.38 MiB is free. Including non-PyTorch memory, this process has 44.26 GiB memory in use. Of the allocated memory 43.02 GiB is allocated by PyTorch, and 933.14 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Thank you very much in advance!