Multi gpu not working

I am using huggingface to train gptj-6b model with 8 gpu’s. But it is not using all gpus and throwing cuda out of memory error. I have tried changing batch_size with multiple of gpus. is there anything I have to mention for using all gpus?
–model_name_or_path EleutherAI/gpt-j-6B
–dataset_name glue
–dataset_config_name cola
–per_device_train_batch_size 8
–per_device_eval_batch_size 8
–output_dir /tmp/gptneo20b_100
this is what am currently running.

You need to launch with torchrun --n_procs_per_node=NGPUS ... in order to enable multi-GPU. Another option is to use Accelerate’s CLI launcher directly:

accelerate launch --multi_gpu --num_processes=NGPUS
1 Like

Here, it shows all 8 gpu’s with same memory error. And why pytorch reserving memory in all 8gpu’s? and not 1.
(Using accelerate launcher)