Does "ViT-B/16" use batch or group normalization?

PhanTrang3564 · January 15, 2024, 7:34pm

Hello, I’m attempting to train ViT-B/16, but the batch size exceeds my GPU memory, so I’m using gradient accumulation. However, according to this post (comparison - What is the relationship between gradient accumulation and batch size? - Artificial Intelligence Stack Exchange), it’s mentioned that gradient accumulation may not be compatible with batch normalization. As a beginner in transformers, I’m curious if “ViT-B/16” utilizes batch or group normalization.

Topic		Replies	Views
Question about Gradient Accumulation step in Trainer 🤗Transformers	2	2615	September 10, 2021
Questions about steps with gradient accumulation Beginners	1	1026	July 19, 2023
What if sequence of outputs of ViT is fed into GPT 🤗Transformers	0	268	August 4, 2022
What is the limit of grad accumulation? Intermediate	2	2911	May 4, 2021
Batch size vs gradient accumulation Beginners	9	33351	November 28, 2024