I am working in a Google Coalab session with a HuggingFace DistilBERT model which I have fine tuned against some data.
I am getting the following error when I try to evaluate a restored copy of my model:-
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking arugment for argument index in method wrapper_index_select)
I run the following piece of code TWICE. Once just after fitting the model, and then once after saving and restoring the model.
metric= load_metric("accuracy")
model.eval()
for batch in test_dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
with torch.no_grad():
outputs = model(**batch)
logits = outputs.logits
predictions = torch.argmax(logits, dim=-1)
metric.add_batch(predictions=predictions, references=batch["labels"])
metric.compute()
If I run the evaluation straight after training there is no problem:-
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:10: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
# Remove the CWD from sys.path while we load stuff.
{'accuracy': 0.6692307692307692}
If I run the above code after saving and restoring the model then I get the error quoted above, the full traceback for which is:-
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:10: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
# Remove the CWD from sys.path while we load stuff.
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-102-33bf1579632a> in <module>()
4 batch = {k: v.to(device) for k, v in batch.items()}
5 with torch.no_grad():
----> 6 outputs = model(**batch)
7
8 logits = outputs.logits
8 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
2041 # remove once script supports set_grad_enabled
2042 _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2043 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
2044
2045
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking arugment for argument index in method wrapper_index_select)
The steps I take for saving and restoring are as follows:-
- Write the model to the colab session’s local disc:-
- Write from local disc (of the colab session) to Google Drive
- Write back from Google Drive to the colab session’s local disc
- Use the copy on the local drive to load the model
The code for step 1 has been adapted from that at run_glue.py and is as follows:-
# Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
output_dir = './a_local_copy/'
# Create output directory if needed
if not os.path.exists(output_dir):
os.makedirs(output_dir)
#logger.info("Saving model checkpoint to %s", args.output_dir)
print("Saving model checkpoint to %s" % output_dir)
# Save a trained model, configuration and tokenizer using `save_pretrained()`.
# They can then be reloaded using `from_pretrained()`
model_to_save = model.module if hasattr(model, 'module') else model # Take care of distributed/parallel training
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
# Good practice: save your training arguments together with the trained model
# torch.save(args, os.path.join(output_dir, 'training_args.bin'))
Step 4 is the straightforward:-
model = AutoModelForSequenceClassification.from_pretrained(output_dir)
tokenizer = AutoTokenizer.from_pretrained(output_dir)
I am happy to load further code if you could give me some guidance as to what would be useful.