Load quantized model in memory

Hey guys,

I’m looking for some guidance on a particular issue.
I’m not using HF’s trainer, but my own pytorch implementation, the workflow is like:

best_score = 0
for epoch in epochs:
    # Training
    # ...
    if metric > best_score:
        best_model = deepcopy(model.state_dict)

And then, to evaluate the best model:

model.load_state_dict(best_model)

This approach works perfectly when model is defined as AutoModelForSequenceClassification.from_pretrained(modelcp, labels). However, when I define model as AutoModelForSequenceClassification.from_pretrained(modelcp, labels, load_in_8bit=True), I encounter the following error:

RuntimeError: Loading a quantized checkpoint into non-quantized Linear8bitLt is not supported. Please call module.cuda() before module.load_state_dict()

Anyone could have any insights on how to resolve this issue? Cheers!

I have the same issue, anyone could help?