### System Info
```
python:3.12.4
transformers:4.45.2
trl:0.11.4
huggingfac…e:0.25.2
accelerate:1.0.1
```
### Information
- [ ] The official example scripts
- [X] My own modified scripts
### Tasks
- [ ] An officially supported task in the `examples` folder
- [X] My own task or dataset (give details below)
### Reproduction
**About**
I am trying to fine-tune llama on multiple GPU using `trl` library. While training, I noticed that `gpu:0` is actively computing, while other GPUs set idle despite their VRAM are consumed. I feel like this is an unexpected act, expecting all GPUs would be busy during training. I looked for this issue but fit for my case.
Here is the relevant code.
```python
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
HfArgumentParser,
TrainingArguments,
pipeline,
logging,
)
from peft import (
LoraConfig,
PeftModel,
prepare_model_for_kbit_training,
get_peft_model,
)
import os, torch, wandb
from datasets import load_dataset
from trl import SFTTrainer, setup_chat_format
```
Checking if torch get all the device, and it does.
```python
torch.cuda.device_count()
3
```
Checking with accelerate (suspicious).
```python
accelerator = Accelerator()
print(f"Using {accelerator.num_processes} processes.")
print(f"Process index: {accelerator.process_index}")
Using 1 processes.
Process index: 0
```
```python
device_string = PartialState().process_index
device_string
0
```
Defining model and set `device_map="auto"`.
```python
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
torch_dtype = torch.float16
attn_implementation = "eager"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch_dtype,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
attn_implementation=attn_implementation
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model, tokenizer = setup_chat_format(model, tokenizer)
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=[
'up_proj', 'down_proj', 'gate_proj',
'k_proj', 'q_proj', 'v_proj', 'o_proj'
]
)
model = get_peft_model(model, peft_config)
```
Defining training arguments. Set `gradient_checkpointing_kwargs={"use_reentrant": False}` as I read in other issue, it is required.
```python
training_arguments = TrainingArguments(
output_dir='result',
per_device_train_batch_size=3*3,
per_device_eval_batch_size=1,
gradient_accumulation_steps=2,
optim="paged_adamw_32bit",
num_train_epochs=1,
evaluation_strategy="steps",
eval_steps=0.2,
logging_steps=1,
warmup_steps=10,
logging_strategy="steps",
learning_rate=2e-4,
fp16=True,
bf16=False,
group_by_length=True,
report_to="wandb",
gradient_checkpointing_kwargs={"use_reentrant": False}
)
```
Checking layout map of model's on different device. It looks ok, Layers are placed on multiple GPUs (0, 1, 2).
```python
for k,v in model.hf_device_map.items():
print(k,v)
model.embed_tokens 0
model.layers.0 0
model.layers.1 0
model.layers.2 1
model.layers.3 1
model.layers.4 1
model.layers.5 1
model.layers.6 1
model.layers.7 1
model.layers.8 1
model.layers.9 1
model.layers.10 1
model.layers.11 1
model.layers.12 1
model.layers.13 1
model.layers.14 1
model.layers.15 1
model.layers.16 2
model.layers.17 2
model.layers.18 2
model.layers.19 2
model.layers.20 2
model.layers.21 2
model.layers.22 2
model.layers.23 2
model.layers.24 2
model.layers.25 2
model.layers.26 2
model.layers.27 2
model.layers.28 2
model.layers.29 2
model.layers.30 2
model.layers.31 2
model.norm 2
model.rotary_emb 2
lm_head 2
```
Start training.
```python
trainer = SFTTrainer(
model=model,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
peft_config=peft_config,
max_seq_length=512,
dataset_text_field="text",
tokenizer=tokenizer,
args=training_arguments,
packing= False,
)
trainer.train()
```
GPU usages


### Expected behavior
At this point, I'm not sure if all GPUs are working as expected or not.