Stanford Alpaca on 2x3090

So I am attempting to do the Stanford Alpaca training that they describe here:

I have a workstation with 512GB Ram and 2 x 3090s with 24GB VRAM. I have reached the point where I am trying to train it but I keep getting out of memory error. I know they used 4x A100 with 80GB VRAM. I tried changing the number of GPUs to 2 and reducing the batch sizes (see below for my torchrun command with args. I have also tried using bitsandbytes to reduce down to 8bits but I am having problems getting it to run.

Any suggestions?

torchrun --nproc_per_node=2 --master_port=13833 train.py
–model_name_or_path decapoda-research/llama-7b-hf
–data_path ./alpaca_data.json
–bf16 True
–output_dir /home/dsa/stanford_alpaca
–num_train_epochs 3
–per_device_train_batch_size 1
–per_device_eval_batch_size 1
–gradient_accumulation_steps 32
–evaluation_strategy “no”
–save_strategy “steps”
–save_steps 2000
–save_total_limit 1
–learning_rate 2e-5
–weight_decay 0.
–warmup_ratio 0.03
–lr_scheduler_type “cosine”
–logging_steps 1
–fsdp “full_shard auto_wrap”
–fsdp_transformer_layer_cls_to_wrap ‘LLaMADecoderLayer’
–tf32 True

1 Like