Load_in_8bit vs. loading 8-bit quantized model

Chahnwoo · May 7, 2024, 2:54am

I have been trying to fine-tune a ~10.7B model with our company’s A100 80GB GPU, but have been running into memory problems. I’ve been attempting to reduce memory usage by quantizing the model with BitsAndBytesConfig (load_in_8bit = True) and lowering batch size (1), but I’m still running into the OutOfMemoryError.

Based on memory consumption estimations derived from this formula, I should only be consuming about 50GB of memory, but that hasn’t been the case.

I’m wondering whether there is a significant difference between loading a quantized model directly and loading a base model with BitsAndBytes quantization. If anyone knows of any resources or directions I can look into, I would appreciate the help.

Or is the difference in memory consumption caused by some other factor I haven’t considered, like the fine-tuning process itself (dataset, trainer, etc.)?

Topic		Replies	Views
Question about memory usage Beginners	0	960	May 15, 2023
Does loading in 4bit override an 8bit model? 🤗Transformers	0	697	October 20, 2023
Qunatized model with LORA takes much more GPU memory than the un-quantized model with LORA for the (E-5-Large Embedding Transformer) 🤗Transformers	4	1781	October 8, 2023
Can I load a model fine-tuned with LoRA 4-bit quantization as an 8-bit model? 🤗Hub	0	291	November 27, 2023
"Out of memory" when loading quantized model 🤗Accelerate	1	1415	January 22, 2024

Load_in_8bit vs. loading 8-bit quantized model

Related topics