Difference in memory efficiency in HF and fairseq

Hello, I’ve been reading this paper on mbart(https://arxiv.org/pdf/2001.08210.pdf) and came across section 2.2 optimization where authors claim to have total batch size of 128K tokens per 32GB GPU. I got my hands on one of those but I only managed to put about 16k (or 32k if they count generator tokens too), I had max_seq_len of 512, batch_size of 4 and grad_acc 8, but it’s stil at least 4 times less. I am using fp16.

So, my question is: what is the difference between HF optimization and fairseq optimization? or what is the difference between fairseq model and HF model?
Thanks :hugs:

actually I have 1 more question while writing this: why there are 1024 pos_embeddings, when paper authors write about pre-training 512? are they randomly initialised or is it something different?

P.S. I’ve been using Facebook/mbart-large-cc25

@patrickvonplaten maybe you can help me understand this.
Thanks!

I think @sshleifer and @valhalla are better equipped to answer your question :slight_smile:

@Zhylkaaa That’s a good question, I don’t know the answer fully. There are a lot of discrepancies between the paper and the fairseq code. You could try to use the linked
command
and see how big you can batch with that. If it’s different, you can ask on fairseq.

Otherwise, could you just do grad_acc=32?

why there are 1024 pos_embeddings, when paper authors write about pre-training 512? are they randomly initialised or is it something different?

This command has --max_tokens=1024, 128 or 64 work better in my experience.

The state dict for mbart had 1024 trained positional embeddings, so we ported all of them.