I am encountering an issue when using the Dinov2ForImageClassification model from the Hugging Face Transformers library, as outlined in the documentation here. Despite following the provided code example and using the latest Transformers version, the resulting model is performing binary classification instead of the expected ImageNet 1000-way classification. Specifically, the length of the logits returned by the model (logits
) is 2, whereas it should be 1000 for ImageNet classification.
Here is my code:
from transformers import AutoImageProcessor, Dinov2ForImageClassification
import torch
from datasets import load_dataset
# Load a sample image dataset (in this case, "huggingface/cats-image")
dataset = load_dataset("huggingface/cats-image")
image = dataset["test"]["image"][0]
# Load the image processor and the Dinov2ForImageClassification model
image_processor = AutoImageProcessor.from_pretrained("facebook/dinov2-base")
model = Dinov2ForImageClassification.from_pretrained("facebook/dinov2-base")
# Prepare the input and obtain logits
inputs = image_processor(image, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
# The expected number of labels for ImageNet classification should be 1000
predicted_label = logits.argmax(-1).item()
However, I encounter the following error:
csharpCopy code
Some weights of Dinov2ForImageClassification were not initialized from the model checkpoint at facebook/dinov2-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Additionally, the shape of logits
is torch.Size([1, 2])
, indicating that the model has only 2 labels instead of the expected 1000 as specified by model.num_labels
.
I’m seeking guidance on how to correctly use Dinov2ForImageClassification for ImageNet 1000-way classification as mentioned in the documentation.