Hi,
I am using huggingface run_clm.py to train gptj-6b model with 8 gpu’s. But it is not using all gpus and throwing cuda out of memory error. I have tried changing batch_size with multiple of gpus. is there anything I have to mention for using all gpus?
python3 run_clm.py
–model_name_or_path EleutherAI/gpt-j-6B
–dataset_name glue
–dataset_config_name cola
–per_device_train_batch_size 8
–per_device_eval_batch_size 8
–do_train
–do_eval
–output_dir /tmp/gptneo20b_100
–num_train_epochs=100
–weight_decay=0.01
–learning_rate=1e-5
this is what am currently running.
You need to launch with torchrun --n_procs_per_node=NGPUS run_clm.py ...
in order to enable multi-GPU. Another option is to use Accelerate’s CLI launcher directly:
accelerate launch --multi_gpu --num_processes=NGPUS run_clm.py...
2 Likes