Iām trying to quantize a CodeLlama model to int8 using smoothquant. I have the circular issue that I need to calibrate the model in fp16, but I want to quantize it in the first place because it wonāt fit in GPU memory.
Iām using this code to quantize my model, and with device_map=āautoā itās taking advantage of accelerate to offload parts of the model to host memory. The model calibrates just fine, but I donāt know what to do later when the quantized weights are computed. On line 78, which runs this code: scales[layer_name_qkv]["x"] = scales[layer_name_q]["x"] / smoother
I get this error: RuntimeError: Tensor on device meta is not on the expected device cuda:0!
That makes sense, but my question is how to move the tensor from āmetaā to ācuda:0ā? The to() call doesnāt work because meta tensors have no data. It seems that accelerate/transformers automatically handles moving the data around with module hooks, but how can I achieve what I want here when Iām working with the raw tensors?