How pretrain ELECTRA on custom dataset?

I have a 1GB raw text dataset in a niche domain. I want to train an ELECTRA model but I couldn’t find any tutorial/examples to do so. Can anyone help me? I tried using the simpletransformers package, but it has memory issues at the moment and after a few epochs my colab session crashes.

pinging @lysandre

Do you want to fine-tune or pre-train an ELECTRA model? If you want to fine-tune it, you can leverage the examples/run_language_modeling.py script.

If you want to pre-train it, your best bet is to use the original implementation (in TF1) and then convert it to our library using our conversion script which is here.

Here’s a PR for pre-training ELECTRA with our library but it’s not working right now and we don’t have the bandwidth to get back into it right now.

@lysandre I am planning to pretrain the model, but I would like to also give a shot to examples/run_language_modeling.py script. How can I finetune the base Electra model on MLM task on my data? The script is a bit ambigious.

I’m not sure if fine-tuning ELECTRA with MLM is good idea since the main idea behind ELECTRA was to train it as a discriminator rather than generator and overcome the problems of MLM’s

Any thoughts @lysandre

1 Like