I’m trying to quantize a CodeLlama model to int8 using smoothquant. I have the circular issue that I need to calibrate the model in fp16, but I want to quantize it in the first place because it won’t fit in GPU memory.
I’m using this code to quantize my model, and with device_map=“auto” it’s taking advantage of accelerate to offload parts of the model to host memory. The model calibrates just fine, but I don’t know what to do later when the quantized weights are computed. On line 78, which runs this code:
scales[layer_name_qkv]["x"] = scales[layer_name_q]["x"] / smoother I get this error:
RuntimeError: Tensor on device meta is not on the expected device cuda:0!
That makes sense, but my question is how to move the tensor from ‘meta’ to ‘cuda:0’? The to() call doesn’t work because meta tensors have no data. It seems that accelerate/transformers automatically handles moving the data around with module hooks, but how can I achieve what I want here when I’m working with the raw tensors?