Hello, I’ve been reading this paper on mbart(https://arxiv.org/pdf/2001.08210.pdf) and came across section 2.2 optimization where authors claim to have total batch size of 128K tokens per 32GB GPU. I got my hands on one of those but I only managed to put about 16k (or 32k if they count generator tokens too), I had max_seq_len of 512, batch_size of 4 and grad_acc 8, but it’s stil at least 4 times less. I am using fp16.
So, my question is: what is the difference between HF optimization and fairseq optimization? or what is the difference between fairseq model and HF model?
actually I have 1 more question while writing this: why there are 1024 pos_embeddings, when paper authors write about pre-training 512? are they randomly initialised or is it something different?
P.S. I’ve been using Facebook/mbart-large-cc25