Load_in_8bit vs. loading 8-bit quantized model

I have been trying to fine-tune a ~10.7B model with our company’s A100 80GB GPU, but have been running into memory problems. I’ve been attempting to reduce memory usage by quantizing the model with BitsAndBytesConfig (load_in_8bit = True) and lowering batch size (1), but I’m still running into the OutOfMemoryError.

Based on memory consumption estimations derived from this formula, I should only be consuming about 50GB of memory, but that hasn’t been the case.

I’m wondering whether there is a significant difference between loading a quantized model directly and loading a base model with BitsAndBytes quantization. If anyone knows of any resources or directions I can look into, I would appreciate the help.

Or is the difference in memory consumption caused by some other factor I haven’t considered, like the fine-tuning process itself (dataset, trainer, etc.)?

You could try load_in_4bit, may take the edge off the VRAM usage.

If you are able to share code snippet, we may be able to see if there are problems leading to unnecessary GPU usage

This is the part of the source code that loads the model. I want to avoid using 4-bit quantization because I’ve read that it can lead to significant degradation in performance.

For additional context, I use accelerate to enable distributed data parallel, and the dataset I use for fine-tuning has approximately 1k entries. I use the datasets.map() function along with the model tokenizer to generate data in the form of { "input_ids" : List[int], "attention_mask" : List[int], "labels" : List[int] } to feed to the transformers Trainer.

As additional context, GPU memory usage seems to take a leap at some point during training.

image (2)

For the first few steps, GPU memory usage seemed to be stable at around 50GB (using load_in_4bit), but this soon jumps to nearly 80GB.

image (3)

I’m not sure what causes this or at which step exactly this occurs. This is a the WandB memory allocation graph for part of this run, just in case it is helpful.

Use LoRA it works with as little as a T4 15 gb, only fine-tuning the small model and then merging it with the big boi

I’ve been using QLoRA, and was wondering if there was an issue with the quantization part.
It turns out the problem was the dataset I was using (token count for some entries was too large and caused OOM error)

But I’m still left wondering about the difference between loading a quantized model from HuggingFace and loading a model with a BitsAndBytes quantization configuration. If anyone knows any resources, would very much appreciate the help.