4bit quantization on inference end point

dbshift · January 16, 2024, 1:06pm

I need to know is there a way to quantize a model in 4bit and run it in inference end point. currently it is taking 8bit quantization as default. one more thing is can I do my custom quantization in handler.py file?

Topic		Replies	Views
Can i create endpoint using quantized model? Inference Endpoints on the Hub	3	722	January 16, 2024
Errors running Inference Endpoint with quantized model Inference Endpoints on the Hub	2	794	September 14, 2023
Inference 8 bit or 4 bit bit models on cpu? Beginners	2	3121	August 3, 2023
Push 4-bit converted model to hub Models	2	2319	October 27, 2023
ERROR \| Expected a cuda device, but got: cpu Inference Endpoints on the Hub	1	953	January 1, 2024

4bit quantization on inference end point

Related topics