Diff between GPTQ and NF4 with bitsandbytes

What are the key differences between GPTQ and NF4 quantisation with bitsandbytes? Are there reasons to expect advantages with one over the other?

I’ve been running GPTQ versus bitsandbytes with NF4 . See below for some data.

Perplexity Results

fLlama-7B (2GB shards) nf4 bitsandbytes quantisation:

  • PPL: 8.8, GPU Mem: 4.7 GB, 12.2 toks.

Llama-7B-GPTQ-4bit-128:

  • PPL: 9.3, GPU Mem: 4.8 GB, 21.4 toks.

fLlama-7B (4GB shards) nf4 bitsandbytes quantisation:

  • PPL: 8.0, GPU Mem: 8.2 GB, 7.9 toks.

Llama-13B-GPTQ-4bit-128:

  • PPL: 7.8, GPU Mem: 8.5 GB, 15 toks.

and here is my bnb config:

bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True, #adds speed with minimal loss of quality.
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        quantization_config=bnb_config,
        device_map='auto', # for inference use 'auto', for training use device_map={"":0}
        trust_remote_code=True,
        cache_dir=cache_dir)
1 Like