Multi gpu not working

anon47283947 · February 3, 2023, 3:53pm

Hi,
I am using huggingface run_clm.py to train gptj-6b model with 8 gpu’s. But it is not using all gpus and throwing cuda out of memory error. I have tried changing batch_size with multiple of gpus. is there anything I have to mention for using all gpus?
python3 run_clm.py
–model_name_or_path EleutherAI/gpt-j-6B
–dataset_name glue
–dataset_config_name cola
–per_device_train_batch_size 8
–per_device_eval_batch_size 8
–do_train
–do_eval
–output_dir /tmp/gptneo20b_100
–num_train_epochs=100
–weight_decay=0.01
–learning_rate=1e-5
this is what am currently running.

muellerzr · February 3, 2023, 7:50pm

You need to launch with torchrun --n_procs_per_node=NGPUS run_clm.py ... in order to enable multi-GPU. Another option is to use Accelerate’s CLI launcher directly:

accelerate launch --multi_gpu --num_processes=NGPUS run_clm.py...

anon47283947 · February 3, 2023, 8:11pm

Here, it shows all 8 gpu’s with same memory error. And why pytorch reserving memory in all 8gpu’s? and not 1.
(Using accelerate launcher)

Topic		Replies	Views
LM example run_clm.py isn't distributing data across multiple GPUs as expected 🤗Transformers	10	2697	May 17, 2023
Accelerate on single GPU doesnt seem to work Beginners	2	5477	May 16, 2023
Having trouble accelerate on my 2 GPU machine Beginners	0	736	May 24, 2023
Multiple gpu training 🤗Transformers	1	2285	August 10, 2024
Can't load huge model onto multiple GPU's Beginners	5	5192	June 15, 2023

Multi gpu not working

Related topics