Bigbird-roberta batch size

Hello I’m wondering about the expected batch size possible with this model. When I run bigbird-roberta with fp16 sequence length of 2048 and using sharded ddp I am only able to get a batch size of 1. I am running on 8 32GB GPUs. My train set isn’t that large (200k docs) but is the problem that it’s trying to load the whole train set onto the GPU at the same time? According to the paper they were able to do the pretraining with a significantly larger batch size. Is there anything I can do beyond my current setup? Thank you for your help.