I have been trying to fine-tune a ~10.7B model with our company’s A100 80GB GPU, but have been running into memory problems. I’ve been attempting to reduce memory usage by quantizing the model with BitsAndBytesConfig (load_in_8bit = True) and lowering batch size (1), but I’m still running into the OutOfMemoryError.
Based on memory consumption estimations derived from this formula, I should only be consuming about 50GB of memory, but that hasn’t been the case.
I’m wondering whether there is a significant difference between loading a quantized model directly and loading a base model with BitsAndBytes quantization. If anyone knows of any resources or directions I can look into, I would appreciate the help.
Or is the difference in memory consumption caused by some other factor I haven’t considered, like the fine-tuning process itself (dataset, trainer, etc.)?