Mistral-7B-v0.1 finetuning results in Out-of-Memory after some iterations

zimmer-m · January 12, 2024, 1:21pm

I am trying to LoRA-finetune mistralai/Mistral-7B-v0.1 on the C4 dataset using an NVIDIA A40 GPU with 40GB of memory. The problem is that PyTorch raises an Out-of-Memory error, however, this happens after 3 successful training iterations, despite nothing else happening - no additional evaluation. I would expect the memory demands to remain constant over some iterations. Am I doing something wrong or is that a potential memory leak issue?

Regarding my setup:
I am using batch size 1, AdamW, gradient_checkpointing and accumulation steps of 1.

Thanks!

ayadav · January 19, 2024, 6:14pm

Could you share your code here for better understanding of the problem?

ayadav · January 19, 2024, 6:16pm

I suspect you could be caching the dataset of every batch in GPU which build after a few iterations. Check the dataloader and make sure you flush out used batches after some steps.

Topic		Replies	Views
CUDA Out of Memory Error When Training Specific Layers 🤗Transformers	6	379	November 2, 2024
Inquiry Regarding Out of Memory Issue During LoRA Fine-Tuning Models	2	145	May 5, 2025
CUDA OOM error when using data-distributed mode on AWS p4d.24xlarge instance Beginners	7	342	December 4, 2024
CUDA out of memory while doing inference in a loop Beginners	5	3328	June 5, 2024
Finetuning LLM(e.g Mistral-7B) on multiple CPUs with (Q)LoRa Beginners	0	898	February 21, 2024

Mistral-7B-v0.1 finetuning results in Out-of-Memory after some iterations

Related topics