So I am attempting to do the Stanford Alpaca training that they describe here:
I have a workstation with 512GB Ram and 2 x 3090s with 24GB VRAM. I have reached the point where I am trying to train it but I keep getting out of memory error. I know they used 4x A100 with 80GB VRAM. I tried changing the number of GPUs to 2 and reducing the batch sizes (see below for my torchrun command with args. I have also tried using bitsandbytes to reduce down to 8bits but I am having problems getting it to run.
torchrun --nproc_per_node=2 --master_port=13833 train.py
–fsdp “full_shard auto_wrap”