Can't Load ViT Model for Fine Tuning

Just trying to load some of the Google ViT models for fine-tuning. My code is as follows:

from transformers import ViTFeatureExtractor

model_name_or_path = 'google/vit-base-patch16-224-in21k'
feature_extractor = ViTFeatureExtractor.from_pretrained(model_name_or_path)

from transformers import ViTForImageClassification, TFViTForImageClassification

labels = ['Background', 'Pedestrian', 'Sign', 'TrafficLight', 'Vehicle'] #ds['train'].features['labels'].names

model = TFViTForImageClassification.from_pretrained(
    model_name_or_path,
    num_labels=len(labels),
    id2label={str(i): c for i, c in enumerate(labels)},
    label2id={c: str(i) for i, c in enumerate(labels)}
)

I can only use the in21k models for some reason. When I change model_name_or_path to any other Google ViT (google/vit-base-patch16-224, google/vit-large-patch16-224, etc), I get the following error at the from_pretrained step

ValueError: cannot reshape array of size 768000 into shape (768,5)

Or a different number based on the model size.

Anyone know how to load these for fine tuning?

Hi,

The reason this works for ‘google/vit-base-patch16-224-in21k’ but not for checkpoints like ‘google/vit-base-patch16-224’ is because the latter include a fine-tuned head on top (namely, a head with 1000 output neurons, as these were fine-tuned on ImageNet-1k).

However, as you’d like to use this model but change the number of output neurons to 5, you need to add the additional ignore_mismatched_sizes=True argument. This ensures that the fine-tuned with 1000 output neurons is replaced by a randomly initialized head with 5 output neurons.

model = TFViTForImageClassification.from_pretrained(
    model_name_or_path,
    num_labels=len(labels),
    id2label={str(i): c for i, c in enumerate(labels)},
    label2id={c: str(i) for i, c in enumerate(labels)},
    ignore_mismatched_sizes=True, # add this to replace the head
)

Gotcha, I assumed there would be a setting to just remove the head.

Thank you!