Memory continuously increasing during `compute_loss()`

shuyuej · December 4, 2023, 6:53pm

When I try to define a Customized Trainer, memory is continuously increasing during training:

Please note that I comment the @torch.no_grad() and add torch.set_grad_enabled(True) in the def generate().

class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")

        # forward pass
        outputs = model(**inputs)
        logits = outputs.get("logits")

        # compute custom loss (suppose one has 3 labels with different weights)
        loss_fct = nn.CrossEntropyLoss(weight=torch.tensor([1.0, 2.0, 3.0], device=model.device))
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

The memory usage is continuously increasing and an CUDA out-of-memory error occurred.

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
/usr4/ec523/brucejia/.local/lib/python3.8/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 3 has a total capacty of 44.35 GiB of which 5.38 MiB is free. Including non-PyTorch memory, this process has 44.26 GiB memory in use. Of the allocated memory 43.02 GiB is allocated by PyTorch, and 933.14 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Thank you very much in advance!

Topic		Replies	Views
Always getting RuntimeError: CUDA out of memory with Trainer 🤗Transformers	10	6920	April 4, 2024
CUDA out of memory when using Trainer with compute_metrics 🤗Transformers	25	46265	June 25, 2025
Cuda out of memory while using Trainer API Beginners	1	1762	October 20, 2021
Repeated training runs out of GPU memory 🤗Transformers	3	260	December 16, 2024
CUDA Out Of Memory when training a DETR Object detection model with compute_metrics 🤗Transformers	3	108	July 17, 2025

Memory continuously increasing during `compute_loss()`

Related topics