Multi-gpu huggingface training using trl

innat · October 22, 2024, 7:36am

I am trying to fine-tune llama on multiple GPU using trl library, and trying to achieve data-parallel and model-parallel both. While training using model-parallel, I noticed that gpu:0 is actively computing, while other GPUs set idle despite their VRAM are consumed. I feel like this is an unexpected act, expecting all GPUs would be busy during training. I am running the model on notebook.

I have raised a ticket in here

github.com/huggingface/trl

multi-gpu training

opened 04:34PM - 20 Oct 24 UTC

innat

### System Info ``` python:3.12.4 transformers:4.45.2 trl:0.11.4 huggingfac…e:0.25.2 accelerate:1.0.1 ``` ### Information - [ ] The official example scripts - [X] My own modified scripts ### Tasks - [ ] An officially supported task in the `examples` folder - [X] My own task or dataset (give details below) ### Reproduction **About** I am trying to fine-tune llama on multiple GPU using `trl` library. While training, I noticed that `gpu:0` is actively computing, while other GPUs set idle despite their VRAM are consumed. I feel like this is an unexpected act, expecting all GPUs would be busy during training. I looked for this issue but fit for my case. Here is the relevant code. ```python from transformers import ( AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, HfArgumentParser, TrainingArguments, pipeline, logging, ) from peft import ( LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model, ) import os, torch, wandb from datasets import load_dataset from trl import SFTTrainer, setup_chat_format ``` Checking if torch get all the device, and it does. ```python torch.cuda.device_count() 3 ``` Checking with accelerate (suspicious). ```python accelerator = Accelerator() print(f"Using {accelerator.num_processes} processes.") print(f"Process index: {accelerator.process_index}") Using 1 processes. Process index: 0 ``` ```python device_string = PartialState().process_index device_string 0 ``` Defining model and set `device_map="auto"`. ```python model_id = "meta-llama/Meta-Llama-3-8B-Instruct" torch_dtype = torch.float16 attn_implementation = "eager" bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch_dtype, bnb_4bit_use_double_quant=True, ) model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=bnb_config, device_map="auto", attn_implementation=attn_implementation ) tokenizer = AutoTokenizer.from_pretrained(model_id) model, tokenizer = setup_chat_format(model, tokenizer) peft_config = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", target_modules=[ 'up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj' ] ) model = get_peft_model(model, peft_config) ``` Defining training arguments. Set `gradient_checkpointing_kwargs={"use_reentrant": False}` as I read in other issue, it is required. ```python training_arguments = TrainingArguments( output_dir='result', per_device_train_batch_size=3*3, per_device_eval_batch_size=1, gradient_accumulation_steps=2, optim="paged_adamw_32bit", num_train_epochs=1, evaluation_strategy="steps", eval_steps=0.2, logging_steps=1, warmup_steps=10, logging_strategy="steps", learning_rate=2e-4, fp16=True, bf16=False, group_by_length=True, report_to="wandb", gradient_checkpointing_kwargs={"use_reentrant": False} ) ``` Checking layout map of model's on different device. It looks ok, Layers are placed on multiple GPUs (0, 1, 2). ```python for k,v in model.hf_device_map.items(): print(k,v) model.embed_tokens 0 model.layers.0 0 model.layers.1 0 model.layers.2 1 model.layers.3 1 model.layers.4 1 model.layers.5 1 model.layers.6 1 model.layers.7 1 model.layers.8 1 model.layers.9 1 model.layers.10 1 model.layers.11 1 model.layers.12 1 model.layers.13 1 model.layers.14 1 model.layers.15 1 model.layers.16 2 model.layers.17 2 model.layers.18 2 model.layers.19 2 model.layers.20 2 model.layers.21 2 model.layers.22 2 model.layers.23 2 model.layers.24 2 model.layers.25 2 model.layers.26 2 model.layers.27 2 model.layers.28 2 model.layers.29 2 model.layers.30 2 model.layers.31 2 model.norm 2 model.rotary_emb 2 lm_head 2 ``` Start training. ```python trainer = SFTTrainer( model=model, train_dataset=dataset["train"], eval_dataset=dataset["test"], peft_config=peft_config, max_seq_length=512, dataset_text_field="text", tokenizer=tokenizer, args=training_arguments, packing= False, ) trainer.train() ``` GPU usages ![image](https://github.com/user-attachments/assets/7f2243d3-7db4-4927-a971-ca28aa45ce53) ![image](https://github.com/user-attachments/assets/20df2724-f743-44ee-9d99-2c5aff4c5123) ### Expected behavior At this point, I'm not sure if all GPUs are working as expected or not.

Smritisharma · July 28, 2025, 9:54am

It sounds like the model-parallel setup isn’t fully utilizing all GPUs. When only gpu:0 is active, it usually means the workload isn’t evenly split. Check if the model layers are properly assigned across GPUs using device_map. Also, notebooks can limit multi-GPU performance—consider running your script with accelerate launch or torchrun for better parallelism.

Topic		Replies	Views
Fine-tunning llama2 with multiple GPU hugging face trainer 🤗Transformers	8	3378	March 7, 2024
Using 3 GPUs for training with Trainer() of transformers 🤗Transformers	2	2305	October 18, 2023
Multiple gpu training 🤗Transformers	1	2468	August 10, 2024
Model Parallelism and Pipelining for Model Training Beginners	3	3395	April 11, 2024
Which method is use HF Trainer with multiple GPU? 🤗Transformers	4	1563	June 19, 2023

Multi-gpu huggingface training using trl

Related topics