I have a dataset with about 15,000 documents, containing about 7GB of uncompressed text, which I’m hoping to use for fine-tuning the Megatron-11b model, but before I get started, I’d like to get a rough estimate of the cost.
If I train in the cloud, I’ll probably use AWS P3 or G4 instances. Roughly how many instance-hours would I expect to need, to complete the fine-tuning?
At a certain price point, I might just decide to buy my own hardware, and do the training on-prem… What the minimum specs of hardware I’d need to be able to perform this task, without running out of memory? The pre-trained Megatron_11b model is 19GB of data… So does that mean I could perform model fine-tuning with only a single 24GB graphics card? Or would I need more than that?
So that even though loading the whole model is possible on a single 24GB graphics card, for fine-tuning, you also need to store activations. Even one batch probably won’t fit on the GPU. Even if you managed to make it fit with tricks like gradient checkpointing, it would be prohibitively slow as it’s only one batch at a time. You’d definitely need more cards to be done in time, and model parallelism. This in turn requires good inter-GPU communication. I’ve run experiments on smaller-scale fine-tuning, without that level of parallelism, so I am not sure how much they apply here; but from my experience this should take you on the order of a week for a single pass over 7GB.
If you want a more precise estimate, it is quite cheap to rent an AWS instance (P3 rather than G4, V100 are just much faster) for 10mn to estimate how much time going through 0.1% of the dataset will take you, then extrapolating. I would be very surprised if it was indeed a better idea to buy GPUs rather than renting unless you plan on doing that kind of large-scale training recurrently.