Does "ViT-B/16" use batch or group normalization?

Hello, I’m attempting to train ViT-B/16, but the batch size exceeds my GPU memory, so I’m using gradient accumulation. However, according to this post (comparison - What is the relationship between gradient accumulation and batch size? - Artificial Intelligence Stack Exchange), it’s mentioned that gradient accumulation may not be compatible with batch normalization. As a beginner in transformers, I’m curious if “ViT-B/16” utilizes batch or group normalization.