HuggingFace ViT 10x Slower than Native Tensorflow (Not Fully Using GPU?)

Any ideas what could be causing this?

I have been using another implementation of ViT and switching to the Transformers library with ‘google/vit-base-patch16-224-in21k’ it is training about 10x slower. It also takes almost 10 minutes for the epoch to move on.

Old:


CPU usage is around 15-25% and GPU is close to 100% most of training.

HF ViT:


CPU usage is similar but GPU is low and spikes to 25% or 50% or 100% very sporadically.

I tried to run them with as many of the same settings as possible. Larger batch size seems to result in less GPU usage for the HF ViT.

I think this might have to do with the image data loader (from this tutorial) not loading images as fast as the native Tensorflow ones, so the GPU cannot be fully utilized

Any help would be much appreciated.