I successfully trained a VIT classifier based on google/vit-large-patch16-224
by following the HF tutorial about classifying cancel cells.
I have about 45000 images per class (3 classes). I tested the model on a test set and the results are pretty bad ~60 precision while I am getting around 91% when using a CNN (yolov8). Both models were trained with the same dataset, the only difference is the CNN used 640x640 images while the ViT resized them to 224x224.
For reference, my model classify a specific car parts into 3 classes and each of these classes contain possibly hundreds of variations which is made even worst when considering the parts could be in various state of rusting or damaged.
Is there a way to use bigger pictures (I dont mind the extra training time)? I have a feeling that once resized to 224, the images are too small for the model to learn to differentiate between them.
Or is it that I just dont have enough samples?