Train from scratch: models and efficiency with 1 GPU

I need to train from scratch a Transformer model and the limitation is to have only 1 GPU.

  • Which Transformer model is the most efficient one to train on 1 GPU, with a good trade-off between accuracy and training speed?
    @sgugger is there available a comparison about training time Vs. BLEU accuracy metrics for different models that are supported in Huggingface library (train from scratch)?

For example:

  • ELECTRA-base (paper here) is shown at page 2 to reach higher BLEU score compared to BERT-base keeping same Pre-train FLOPs. Or, from another point of view, one can create an ELECTRA model (somewhere between ELECTRA-small and ELECTRA-base) that can reach the same BLEU score as BERT-base, but with less Pre-train FLOPs. In this manner, the pre-training time and resources is optimized. However, ELECTRA train from scratch is not supported in Huggingface library.
  • FNet (paper here) is shown to achieve 92% of BERT-Base accuracy on the GLUE benchmark, but with a train that is 80% faster on GPU (page 9).

Any suggestion which model to choose and which best configuration (e.g. DeepSpeed as well) to achieve a BERT-base BLEU score but with much less training time on 1 GPU?
Thanks a lot!