Dino2 for classification has wrong number of labels

I am encountering an issue when using the Dinov2ForImageClassification model from the Hugging Face Transformers library, as outlined in the documentation here. Despite following the provided code example and using the latest Transformers version, the resulting model is performing binary classification instead of the expected ImageNet 1000-way classification. Specifically, the length of the logits returned by the model (logits) is 2, whereas it should be 1000 for ImageNet classification.

Here is my code:

from transformers import AutoImageProcessor, Dinov2ForImageClassification
import torch
from datasets import load_dataset

# Load a sample image dataset (in this case, "huggingface/cats-image")
dataset = load_dataset("huggingface/cats-image")
image = dataset["test"]["image"][0]

# Load the image processor and the Dinov2ForImageClassification model
image_processor = AutoImageProcessor.from_pretrained("facebook/dinov2-base")
model = Dinov2ForImageClassification.from_pretrained("facebook/dinov2-base")

# Prepare the input and obtain logits
inputs = image_processor(image, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits

# The expected number of labels for ImageNet classification should be 1000
predicted_label = logits.argmax(-1).item()

However, I encounter the following error:

csharpCopy code

Some weights of Dinov2ForImageClassification were not initialized from the model checkpoint at facebook/dinov2-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Additionally, the shape of logits is torch.Size([1, 2]), indicating that the model has only 2 labels instead of the expected 1000 as specified by model.num_labels.

I’m seeking guidance on how to correctly use Dinov2ForImageClassification for ImageNet 1000-way classification as mentioned in the documentation.

When I load the model using the following code:

model = Dinov2ForImageClassification.from_pretrained("facebook/dinov2-base", num_labels=1000)

It indeed corrects the label dimensions, but it doesn’t load the pretrained weights. My intention is to utilize the model for classification without any additional training while still benefiting from the pretrained weights.

soled here: