How does one do full fine-tuning on Falcon 180B?

alvations · September 8, 2023, 12:13pm

From the blogpost, Spread Your Wings: Falcon 180B is here there’s a breakdown on instance needed and memory needed for full fine-tuning.

Is there some guide on how to do that in Sagemaker?

The questions in parts:

Does running a simple train_clm.py script work for full fine-tuning?
- There’s an example like https://nbviewer.org/github/huggingface/notebooks/blob/main/sagemaker/28_train_llms_with_qlora/scripts/run_clm.py but it is working on 1 instance and using qlora for fine-tuning
- If not, are there’s examples of doing distributed full fine-tuning? Would this work? Jupyter Notebook Viewer
What configurations are available when using the “distribution” argument for full fine-tuning?
- This won’t work easily for 180B, right? "smdistributed": {"dataparallel": {"enabled": True}}
“8 x 8 x A100” would be 8 counts of ml.p4d.24xlarge on Sagemaker, is that correct?
7,000,000 GPU hours on “8 x 8 x A100”, would that equate to
- 7,000,000 / 8 instance counts / 8 GPUs = 109,375 hours on “8 x 8 x A100”
- 109,375 / 24 hours a day / 365 days a year ~= 12 years on “8 x 8 x A100”
- so, to do full fine-tune as much as the training day on “8 x 8 x A100” would take 12 years?
- and to acheive the same amount of training in lets say 1 year, we have to do 96 instance counts of ml.p4d.24xlarge?
  - and if we take $19.22 per hour on the instance, with 96 instance for a full year, the sum cost to train the model would be around $19.22 per instance per hour * 24 hours a day * 365 days * 96 instances ~= US$16 million (if we’re budgeting for a full fine-tuning a similar model, would $16M be an appropriate number?)

Thank you in advance for the information! Look forward to anyone with more information on how to do full fine-tuning on the 180B model.

Topic		Replies	Views
Deploying Fine-Tune Falcon 40B with QLoRA on Sagemaker Inference Error Amazon SageMaker	29	6817	January 8, 2024
VRAM Usage Differences in SageMaker Training Jobs vs. Direct Instance for Fine-Tuning LLama3 8B with QLoRA Amazon SageMaker	0	61	October 18, 2024
How to train an already finetuned LLM(LLama2)? Intermediate	0	300	March 13, 2024
Estimating Training Time for Fine Tuning Beginners	2	4095	November 2, 2020
LoRA / QLoRA fine tuning a 8b Model(llama 3.1) Beginners	1	273	February 24, 2025