So I am attempting to do the Stanford Alpaca training that they describe here:
I have a workstation with 512GB Ram and 2 x 3090s with 24GB VRAM. I have reached the point where I am trying to train it but I keep getting out of memory error. I know they used 4x A100 with 80GB VRAM. I tried changing the number of GPUs to 2 and reducing the batch sizes (see below for my torchrun command with args. I have also tried using bitsandbytes to reduce down to 8bits but I am having problems getting it to run.
Any suggestions?
torchrun --nproc_per_node=2 --master_port=13833 train.py
âmodel_name_or_path decapoda-research/llama-7b-hf
âdata_path ./alpaca_data.json
âbf16 True
âoutput_dir /home/dsa/stanford_alpaca
ânum_train_epochs 3
âper_device_train_batch_size 1
âper_device_eval_batch_size 1
âgradient_accumulation_steps 32
âevaluation_strategy ânoâ
âsave_strategy âstepsâ
âsave_steps 2000
âsave_total_limit 1
âlearning_rate 2e-5
âweight_decay 0.
âwarmup_ratio 0.03
âlr_scheduler_type âcosineâ
âlogging_steps 1
âfsdp âfull_shard auto_wrapâ
âfsdp_transformer_layer_cls_to_wrap âLLaMADecoderLayerâ
âtf32 True