I have been inferencing a 7B llm model using TGI on runpod. till now i have been directly loading the model without quantisation.
But recently when i tried quantisation of model on 4bit and 8bit using TGI on same gpu configurations. After quantisation the model was taking more inference time in both 8-bit and 4-bit.
I was using bitsandbytes for quantisation.
Is this correct or am i missing something? shouldn’t the model inference time also reduce?