Train from scratch: models and efficiency with 1 GPU

mattivi · September 23, 2021, 11:00am

Hi,
I need to train from scratch a Transformer model and the limitation is to have only 1 GPU.

Which Transformer model is the most efficient one to train on 1 GPU, with a good trade-off between accuracy and training speed?
@sgugger is there available a comparison about training time Vs. BLEU accuracy metrics for different models that are supported in Huggingface library (train from scratch)?

For example:

ELECTRA-base (paper here) is shown at page 2 to reach higher BLEU score compared to BERT-base keeping same Pre-train FLOPs. Or, from another point of view, one can create an ELECTRA model (somewhere between ELECTRA-small and ELECTRA-base) that can reach the same BLEU score as BERT-base, but with less Pre-train FLOPs. In this manner, the pre-training time and resources is optimized. However, ELECTRA train from scratch is not supported in Huggingface library.
FNet (paper here) is shown to achieve 92% of BERT-Base accuracy on the GLUE benchmark, but with a train that is 80% faster on GPU (page 9).

Any suggestion which model to choose and which best configuration (e.g. DeepSpeed as well) to achieve a BERT-base BLEU score but with much less training time on 1 GPU?
Thanks a lot!
Riccardo

Topic		Replies	Views
Finetuning and single-GPU utilization 🤗Transformers	0	499	August 19, 2021
Training of new ELECTRA or ConvBERT language model possible? 🤗Transformers	0	265	May 3, 2021
Trainer class optimization for transformer models Models	0	431	January 8, 2022
Model Parallelism, how to parallelize transformer? Beginners	3	12819	June 18, 2021
Advice to speed and performance 🤗Transformers	4	7268	December 7, 2020

Train from scratch: models and efficiency with 1 GPU

Related topics