Pre-training with Lamb optimizer

Hello everyone,

Has anyone experimented with Lamb optimizers in HF? I tried using but I was only marginally able to increase batch size and the training loss curve was rather flat. If you’ve used lamb would you please share some tips. How did you initliaze it? I am not sure what to use in optimizer_grouped_parameters list of dictionaries that wrap model parameters. Also, I’ve seen some other people use a different lr scheduler with Lamb.

Thanks in advance.

Hi vblagoje,

I am new to transformer. I have been playing the hugging face model for several month and I think I am thinking to made a some small changes on the Bert Model and pretrain it from scratch. I saw you discussing on another post several days ago about the pretraining process. I was wondering if you know the pretraining repository made by Nvidia?

I think they implemented the lamb optimizer, NSP objective and wrote code to better utilized multiple gpu during distributed training. I still haven’t use it yet because I have some trouble with installing docker on the remote machine I am working on. I was just wondering if you already seen this repository or tried it, or if you have any advice on pretraining bert from scratch?

Hey @zeyuyun1,

Yes, I am aware of the NVidia repo, however, I haven’t used their scripts. I would like to use the HF library to train BERT from scratch using HF Trainer class, HF datasets project, and helper classes like DataCollatorForNextSentencePrediction. NVidia scripts are excellent but noisy, with lots of engineering details explicitly mixed with the BERT specifics. These engineering details should be hidden; using the above classes and projects is a step in the right direction to minimize the engineering details.

And yes you are right; they use FusedLamb from apex optimizers package. I was able to integrate FusedLamb as well. I am currently tuning the multi-node multi-GPU distributed training and once I am done, I’ll share the script. But yes, so far on a single instance I can train BERT tiny or BERT mini without any major issues.

Hope this answers some of your questions. I’ll share the scripts I am working on once I have them training BERT base on multi-node multi-GPU distributed training setup.



1 Like

Thank you so much! I’ll look into the training process using HF Trainer too.

1 Like

I have tried the same repo and the same case happened. the loss curve went flat after a few iterations. were you able to lay your hand on any other implementations?

Hey guys, I am using apex.optimizers FusedLamb and it’s working well. I’ll publish my work in about a week or two. I can now train bert-mini on lambdalabs 8x Tesla V100 single machine in about 3 hours and 40 min. The above-mentioned NVidia training trains the same model in about 2 hours and 30 min. My goal right now is to match the performance of equivalent Google/NVidia baked models on various LM tests (Glue etc) and then I’ll focus on closing the training speed performance.


Hi Vladimir,

Would you mind sharing your training code? I still didn’t figure out how to implement FusedLamb.

Hey there, I’ll share all the details in a week or so. Until I really wrap this up note that I used this script to create sharded datasets for bert training. After dataset preparation, I used this script to train BERT. There are still a few small bugs to iron out but it works quite well. I can train bert base in about 8-9 hours on 8gpu machine using Pytorch distributed training.