I have a question related to fine-tuning:
Is fine-tuning a pre-trained transformer model a easier model an ‘easier’ task than training a transformer from scratch (BERT, GPT-2) in terms of GPU needs and GPU memory usage?
To clarify further, I’ve read how to train most transformer models, one would require multi-GPU training. However, is it possible to fine-tune some of these models on a single-GPU?
Why is this the case? Is it because the computational graph during backpropagation is smaller when a lot of the parameters are frozen?