I have used google’s ‘vit-base-patch16-224-in21k’ pretrained ViT model for my image dataset.
I have sequences of images (some 1315 images in total, which is not a large dataset), which I am trying to classify as human present, no human present, which is binary classification.
Model: “model_6”
Layer (type) Output Shape Param #
input_7 (InputLayer) [(None, 3, 224, 224)] 0
vit (TFViTMainLayer) TFBaseModelOutputWithPool 86389248
global_average_pooling1d (Gl (None, 768) 0
dense_12 (Dense) (None, 256) 196864
dropout_40 (Dropout) (None, 256) 0
outputs (Dense) (None, 1) 257
Total params: 86,586,369
Trainable params: 197,121
Non-trainable params: 86,389,248
None
After training, I get 100% accuracy on my test, validation and training dataset!
my accuracy and loss curves look like this.
how can it be possible?