Hey @zeyuyun1,
Yes, I am aware of the NVidia repo, however, I haven’t used their scripts. I would like to use the HF library to train BERT from scratch using HF Trainer class, HF datasets project, and helper classes like DataCollatorForNextSentencePrediction
. NVidia scripts are excellent but noisy, with lots of engineering details explicitly mixed with the BERT specifics. These engineering details should be hidden; using the above classes and projects is a step in the right direction to minimize the engineering details.
And yes you are right; they use FusedLamb from apex optimizers package. I was able to integrate FusedLamb as well. I am currently tuning the multi-node multi-GPU distributed training and once I am done, I’ll share the script. But yes, so far on a single instance I can train BERT tiny or BERT mini without any major issues.
Hope this answers some of your questions. I’ll share the scripts I am working on once I have them training BERT base on multi-node multi-GPU distributed training setup.
Cheers,
Vladimir
.