I have been trying to fine-tune a ~10.7B model with our company’s A100 80GB GPU, but have been running into memory problems. I’ve been attempting to reduce memory usage by quantizing the model with BitsAndBytesConfig (load_in_8bit = True) and lowering batch size (1), but I’m still running into the O…

Load_in_8bit vs. loading 8-bit quantized model

swtb May 7, 2024, 1:46pm 2

You could try load_in_4bit, may take the edge off the VRAM usage.

If you are able to share code snippet, we may be able to see if there are problems leading to unnecessary GPU usage

Topic		Replies	Views
Question about memory usage Beginners	0	937	May 15, 2023
Does loading in 4bit override an 8bit model? 🤗Transformers	0	697	October 20, 2023
Qunatized model with LORA takes much more GPU memory than the un-quantized model with LORA for the (E-5-Large Embedding Transformer) 🤗Transformers	4	1769	October 8, 2023
Can I load a model fine-tuned with LoRA 4-bit quantization as an 8-bit model? 🤗Hub	0	291	November 27, 2023
"Out of memory" when loading quantized model 🤗Accelerate	1	1405	January 22, 2024