Qunatized model with LORA takes much more GPU memory than the un-quantized model with LORA for the (E-5-Large Embedding Transformer)

I am trying to use QLORA, and theoretically, it should work with less memory when compared to LORA-only.

I am working on this tutorial in the PEFT library.

The above code works well with 16GB Nvidia Tesla with a 96 batch size.

But for the same batch size when I use 4-bits it gives me an outof memory error.

Why this is happening ?

        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            device_map="auto",
        )

        self.model = AutoModel.from_pretrained(model_name, quantization_config=bnb_config, 
                                                                        trust_remote_code=True)

what model are you trying to use?
what lora r are you using?
do you have other models loaded on your ram?

As given in the tutorial, I am using E-5-Large. The LORA config is the same as in the tutorial.

I doubled checked the GPU consumption and when using the quantized model it takes more memory compared to using the normal model with LORA.

Tried both 8-bits and 4-bits, still the same. Any reason for this?

Did it train in a faster time though?

And did you find the answer to your problem anywhere else?
thank you.

The same question is also answered here: Results are inconsistent and is not reliable enough · Issue #1 · RahulSChand/gpu_poor · GitHub