ELECTRA training reimplementation and discussion

This is no easy feat, I know it first hand as I am doing something similar with BERT pre-training from scratch. Any reason why you didn’t use HF Trainer?