Pre-training with Lamb optimizer

vblagoje · October 20, 2020, 9:12am

Hello everyone,

Has anyone experimented with Lamb optimizers in HF? I tried using https://github.com/cybertronai/pytorch-lamb but I was only marginally able to increase batch size and the training loss curve was rather flat. If you’ve used lamb would you please share some tips. How did you initliaze it? I am not sure what to use in optimizer_grouped_parameters list of dictionaries that wrap model parameters. Also, I’ve seen some other people use a different lr scheduler with Lamb.

Thanks in advance.

zeyuyun1 · October 26, 2020, 5:53am

Hi vblagoje,

I am new to transformer. I have been playing the hugging face model for several month and I think I am thinking to made a some small changes on the Bert Model and pretrain it from scratch. I saw you discussing on another post several days ago about the pretraining process. I was wondering if you know the pretraining repository made by Nvidia?

I think they implemented the lamb optimizer, NSP objective and wrote code to better utilized multiple gpu during distributed training. I still haven’t use it yet because I have some trouble with installing docker on the remote machine I am working on. I was just wondering if you already seen this repository or tried it, or if you have any advice on pretraining bert from scratch?

vblagoje · October 26, 2020, 10:31am

Hey @zeyuyun1,

Yes, I am aware of the NVidia repo, however, I haven’t used their scripts. I would like to use the HF library to train BERT from scratch using HF Trainer class, HF datasets project, and helper classes like DataCollatorForNextSentencePrediction. NVidia scripts are excellent but noisy, with lots of engineering details explicitly mixed with the BERT specifics. These engineering details should be hidden; using the above classes and projects is a step in the right direction to minimize the engineering details.

And yes you are right; they use FusedLamb from apex optimizers package. I was able to integrate FusedLamb as well. I am currently tuning the multi-node multi-GPU distributed training and once I am done, I’ll share the script. But yes, so far on a single instance I can train BERT tiny or BERT mini without any major issues.

Hope this answers some of your questions. I’ll share the scripts I am working on once I have them training BERT base on multi-node multi-GPU distributed training setup.

Cheers,
Vladimir

.

zeyuyun1 · October 27, 2020, 12:52am

Thank you so much! I’ll look into the training process using HF Trainer too.

abdallah197 · November 18, 2020, 10:16am

I have tried the same repo and the same case happened. the loss curve went flat after a few iterations. were you able to lay your hand on any other implementations?

vblagoje · November 18, 2020, 2:05pm

Hey guys, I am using apex.optimizers FusedLamb and it’s working well. I’ll publish my work in about a week or two. I can now train bert-mini on lambdalabs 8x Tesla V100 single machine in about 3 hours and 40 min. The above-mentioned NVidia training trains the same model in about 2 hours and 30 min. My goal right now is to match the performance of equivalent Google/NVidia baked models on various LM tests (Glue etc) and then I’ll focus on closing the training speed performance.

Best,
Vladimir

zeyuyun1 · December 24, 2020, 1:06am

Hi Vladimir,

Would you mind sharing your training code? I still didn’t figure out how to implement FusedLamb.

vblagoje · December 28, 2020, 5:24pm

Hey there, I’ll share all the details in a week or so. Until I really wrap this up note that I used this script to create sharded datasets for bert training. After dataset preparation, I used this script to train BERT. There are still a few small bugs to iron out but it works quite well. I can train bert base in about 8-9 hours on 8gpu machine using Pytorch distributed training.

HTH,
Vladimir

Topic		Replies	Views
Using HF to train a custom PyTorch architecture Beginners	0	510	July 29, 2022
Looking for tool class to do predictions 🤗Transformers	3	551	October 9, 2020
Advice to speed and performance 🤗Transformers	4	7220	December 7, 2020
Optimization strategie 🤗Transformers	0	267	October 21, 2022
Can we parallelize transformers fine-tuning on a Hadoop cluster? 🤗Transformers	0	343	April 7, 2023

Pre-training with Lamb optimizer

Related topics