Trainer leaked memory?

sp4912 · May 5, 2023, 5:29am

I am trying to train Llama-7B on a batch size of 1 using deepspeed and huggingface trainer. I have 48GB of memory on my GPU. I am able to train, but after training there appears to be 27GB of residual memory that was not there just before trainer.train() (the memory appears to be set there exactly when trainer.train() is called, based on nvidia-smi calls inside my script).

Deleting the model, deleting the trainer, torch.cuda.empty_cache() all do nothing to remove that memory. How can I edit this memory so that I can continue on in my script?

vergilus · October 15, 2024, 11:31am

did you use deepspeed? try the trainer without deepspeed

Topic		Replies	Views
Repeated training runs out of GPU memory 🤗Transformers	3	252	December 16, 2024
Fine-tuning Llama-7B Models	2	10613	May 2, 2023
Finetune LLM with DeepSpeed DeepSpeed	2	5119	February 22, 2024
CUDA out of memory on multi-GPU 🤗Transformers	1	2644	March 6, 2024
Run_mlm.py cuda error memory after resuming a training 🤗Transformers	4	2904	April 21, 2021

Trainer leaked memory?

Related topics