Difference in memory efficiency in HF and fairseq

Zhylkaaa · October 23, 2020, 6:13pm

Hello, I’ve been reading this paper on mbart(https://arxiv.org/pdf/2001.08210.pdf) and came across section 2.2 optimization where authors claim to have total batch size of 128K tokens per 32GB GPU. I got my hands on one of those but I only managed to put about 16k (or 32k if they count generator tokens too), I had max_seq_len of 512, batch_size of 4 and grad_acc 8, but it’s stil at least 4 times less. I am using fp16.

So, my question is: what is the difference between HF optimization and fairseq optimization? or what is the difference between fairseq model and HF model?
Thanks

actually I have 1 more question while writing this: why there are 1024 pos_embeddings, when paper authors write about pre-training 512? are they randomly initialised or is it something different?

P.S. I’ve been using Facebook/mbart-large-cc25

Zhylkaaa · October 27, 2020, 3:44am

@patrickvonplaten maybe you can help me understand this.
Thanks!

patrickvonplaten · November 2, 2020, 1:08pm

I think @sshleifer and @valhalla are better equipped to answer your question

sshleifer · November 3, 2020, 7:10pm

@Zhylkaaa That’s a good question, I don’t know the answer fully. There are a lot of discrepancies between the paper and the fairseq code. You could try to use the linked
command and see how big you can batch with that. If it’s different, you can ask on fairseq.

Otherwise, could you just do grad_acc=32?

why there are 1024 pos_embeddings, when paper authors write about pre-training 512? are they randomly initialised or is it something different?

This command has --max_tokens=1024, 128 or 64 work better in my experience.

The state dict for mbart had 1024 trained positional embeddings, so we ported all of them.

Topic		Replies	Views
RoBERTa large: HF vs. FAIRseq Models	1	212	May 9, 2024
Large max differences between single input processing and batching with Bert and T5 🤗Transformers	0	553	April 26, 2021
Quantize and Optimize summarization model (Seq2SeqLM) Beginners	0	351	August 12, 2022
Cannot convert mbart from fairseq to huggingface using the script in the repo 🤗Transformers	3	1253	February 8, 2022
Benchmark results 🤗Transformers	1	748	July 19, 2020

Difference in memory efficiency in HF and fairseq

Related topics