When loading a LLM using int8 quantization as specified in LLM.int8(), how are fp16_weights
handled?
from transformers import (
AutoModelForCausalLM,
BitsAndBytesConfig,
)
from bitsandbytes.nn.modules import Linear8bitLt
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_has_fp16_weight=True,
llm_int8_threshold=0.0000000001,
)
print(quantization_config)
model = AutoModelForCausalLM.from_pretrained(
"cache/bigscience/bloom-1b7",
quantization_config=quantization_config,
)
for k, v in model.named_modules():
if isinstance(v, Linear8bitLt):
print("===")
print(k)
print(v.weight.has_fp16_weights)
Examples of output are:
transformer.h.23.self_attention.dense
False
===
transformer.h.23.mlp.dense_h_to_4h
False
===
transformer.h.23.mlp.dense_4h_to_h
Fals
Even though I specified llm_int8_has_fp16_weight
as True, all Linear8bitLt
module weight’s has_fp16_weights
are printed as False.
Does it internally holds fp16 weights (for outliers)? How can I access to these values.
I’m so confused.