Load_in_8bit vs. loading 8-bit quantized model

Chahnwoo · May 8, 2024, 12:44am

As additional context, GPU memory usage seems to take a leap at some point during training.

image (2)

For the first few steps, GPU memory usage seemed to be stable at around 50GB (using load_in_4bit), but this soon jumps to nearly 80GB.

image (3)

I’m not sure what causes this or at which step exactly this occurs. This is a the WandB memory allocation graph for part of this run, just in case it is helpful.

Topic		Replies	Views
Question about memory usage Beginners	0	937	May 15, 2023
Does loading in 4bit override an 8bit model? 🤗Transformers	0	697	October 20, 2023
Qunatized model with LORA takes much more GPU memory than the un-quantized model with LORA for the (E-5-Large Embedding Transformer) 🤗Transformers	4	1769	October 8, 2023
Can I load a model fine-tuned with LoRA 4-bit quantization as an 8-bit model? 🤗Hub	0	291	November 27, 2023
"Out of memory" when loading quantized model 🤗Accelerate	1	1405	January 22, 2024

Load_in_8bit vs. loading 8-bit quantized model

Related topics