Fine-tune a Llama 2 7b hf take 160 hours on RTX 4070?

Hi,

I would like to train a Llama 2 7b based on a singular RTX 4070 GPU with a small dataset by running auto train command locally:

autotrain llm --train --project-name my-llm --model meta-llama/Llama-2-7b-hf --data-path timdettmers/openassistant-guanaco --use-peft --quantization int4 --lr 2e-4 --batch 2 --epochs 3 --trainer sft

But the esitimation is still like this after running almost half a day.
1%|▌ | 47/6393 [11:58:31<1612:04:01, 914.50s/it]

Is there any approach or parameters changes which can make it quicker?
Any suggestions or helps will be highly appreciated.

K

Are you maxing out your available VRAM?

The available Parameters are here:
autotrain-advanced/src/autotrain/trainers/clm/params.py at main · huggingface/autotrain-advanced (github.com)

Mixed Precision may increase throughput. Though I am unsure whether this can be used concurrently with peft.

auto_find_batch_size may ensure that you get a good batch size that fully utilises your 4090 VRAM.

2 Likes