Load_in_8bit vs. loading 8-bit quantized model

Chahnwoo · May 8, 2024, 12:18am

This is the part of the source code that loads the model. I want to avoid using 4-bit quantization because I’ve read that it can lead to significant degradation in performance.

For additional context, I use accelerate to enable distributed data parallel, and the dataset I use for fine-tuning has approximately 1k entries. I use the datasets.map() function along with the model tokenizer to generate data in the form of { "input_ids" : List[int], "attention_mask" : List[int], "labels" : List[int] } to feed to the transformers Trainer.

Topic		Replies	Views
Question about memory usage Beginners	0	937	May 15, 2023
Does loading in 4bit override an 8bit model? 🤗Transformers	0	697	October 20, 2023
Qunatized model with LORA takes much more GPU memory than the un-quantized model with LORA for the (E-5-Large Embedding Transformer) 🤗Transformers	4	1769	October 8, 2023
Can I load a model fine-tuned with LoRA 4-bit quantization as an 8-bit model? 🤗Hub	0	291	November 27, 2023
"Out of memory" when loading quantized model 🤗Accelerate	1	1405	January 22, 2024

Load_in_8bit vs. loading 8-bit quantized model

Related topics