Low bf16 performance on TPU, int4 vs int8 quantizatoin

lolkek420 · June 1, 2024, 5:12pm

hi, I am trying to finetune llama3 with LoRA and the most recent versions of peft, accelerate, torch and bitsandbytes, I struggle with the following:

bnb_8bit_compute_dtype=torch.bfloat16

in the BitsAndBytesConfig for 8bit, they are present only for 4bit and I could not find out whether I can use them iterchangeably.

Is there any comparison of the finetune perfomance between loading model in 4bit and 8bit
For some reasong, setting compute dtype to bfloat16 in the config, makes it increadibly slow on TPU with the most recent version of the libraries (on kaggle). It is 10 times slower than doing exactly the same thing with float16!

bnb_4bit_compute_dtype=torch.bfloat16

If I finetune a model with lora on GPU and use compute dtype bfloat16, then for inference I load it in float16 on TPU (because, again, bfloat is slow as hell). The quality will degrade, right? As LoRA also adapts for the rounding errors during quantization?

Thanks in advance!

Topic		Replies	Views
Bitsandbytes quantization and QLORA fine-tuning 🤗Transformers	1	272	November 5, 2024
Changing bnb_4bit_compute_dtype Beginners	0	136	July 18, 2024
RuntimeError with Mixed Precision during LoRA Fine-Tuning in LLAVA on Small GPU Machine 🤗Transformers	1	220	September 23, 2024
The role of the bf16 arguments in SFTConfig 🤗Transformers	0	394	July 25, 2024
Loading in Float32 vs Float16 has very different speed 🤗Transformers	1	114	February 20, 2025