Yes, I am aware of the NVidia repo, however, I haven’t used their scripts. I would like to use the HF library to train BERT from scratch using HF Trainer class, HF datasets project, and helper classes like
DataCollatorForNextSentencePrediction. NVidia scripts are excellent but noisy, with lots of engineering details explicitly mixed with the BERT specifics. These engineering details should be hidden; using the above classes and projects is a step in the right direction to minimize the engineering details.
And yes you are right; they use FusedLamb from apex optimizers package. I was able to integrate FusedLamb as well. I am currently tuning the multi-node multi-GPU distributed training and once I am done, I’ll share the script. But yes, so far on a single instance I can train BERT tiny or BERT mini without any major issues.
Hope this answers some of your questions. I’ll share the scripts I am working on once I have them training BERT base on multi-node multi-GPU distributed training setup.