How long it takes to train Falcon 7B model using RTX 4090 GPU?

This is my first question here and sorry if it’s asked before or looks very basic. But I am interested to find a simple formula to estimate how long it would take if we wanted to train Falcon 7B model(Or any other models like 40B or GPT 3, etc.) on a single 4090 GPU?

I know it maybe insane doing such a thing but as an AI student/hobbyst with tight budget it’s always interesting to know what models we could train in our single, dual, quad GPUs at home?

Is there a simple formula we can put our GPU memory and speed, also the models parameters or size to get a rough estimation about the training time?

There is one rule of thumb in my experience (and I’m only using MPT models) it takes about 12x as much GPU memory as the size of the model. There are some tricks, but I’ve not tried them.

As for time, that’s not a great question. You can train a model in an hour, but it’s not going to be very good. You continue training the model and testing checkpoints until it’s good enough. I have been training for several months (coded as a screen saver, so it only runs when I’m not using the computer), on a 3090 GPU; it’s not terrible, but the MPT-7B is still better for most of my use cases. I’ve not used the Falcon 7B model, but I’d hazzard a guess that it’s probably better, too.

Good luck and have fun!

@sgtflame For one 4090 with 24GB VRAM your 12x rule means you can train max a 2GB model? So since mpt-7B looks like it is about 10 GB, with your 12x rule that means it is impossible to train on a build even with two 4090s? Even if you train the model for like 6 months?
Where can I read more about the model size to resources needed to train?


I recommend reading Methods and tools for efficient training on a single GPU, which includes many tips and tricks to train your models efficiently on a single GPU. Memory usage is explained in detail at Model training anatomy.

Regarding training (fine-tuning) a 7B model on a single RTX 4090 GPU (which has 24GB of RAM), that is only possible using either LoRa or QLoRa, which freeze the base model (either in half precision or in 4 bits) and train adapters on top of it. I have a notebook on that here.