Training llama with Lora on multiple GPUs may exist bug

Colorful · July 16, 2023, 5:12pm

Hi,

I want to fine-tune llama with Lora on multiple GPUs on my private dataset. I write the code following popular repositories in GitHub. I successfully ran my code on 1 GPU. But when I tried to ran it on multiple GPUs, I met the following problem (I used TORCH_DISTRIBUTED_DEBUG=DETAIL to debug):

Parameter at index 127 with name base_model.model.model.layers.31.self_attn.v_proj.lora_B.default.weight has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.

To be honest, I’m not familiar with Lora, so I made another attempt, to fine-tune llama with full-params on multiple GPUs, and I found everything is ok. So I think this problem may be caused by Lora. Here is my code for setting llama with Lora:

LORA_R = 8
LORA_ALPHA = 16
LORA_DROPOUT= 0.05
LORA_TARGET_MODULES = [
    "q_proj",
    "v_proj",
]

model = LlamaForCausalLM.from_pretrained(
        model_args.model_name_or_path,
        cache_dir=data_args.cache_path,
        torch_dtype=torch.float16,
        load_in_8bit=True,
        quantization_config=BitsAndBytesConfig(
            load_in_8bit=True,
            llm_int8_threshold=6.0
        ),
    )
    model = prepare_model_for_int8_training(model)

    config = LoraConfig(
        r=LORA_R,
        lora_alpha=LORA_ALPHA,
        target_modules=LORA_TARGET_MODULES,
        lora_dropout=LORA_DROPOUT,
        bias="none",
        task_type="CAUSAL_LM",
    )
    model = get_peft_model(model, config)

So neeeeeeeeeeeeed help!!!

cyenjoylife · July 27, 2023, 4:29am

hit the same problem, can anyone help?

nguyenvlm · August 4, 2023, 6:49pm

TL;DR
Setting ddp_find_unused_parameters to False in TrainingArguments will fix the error.

I just got the same problem, and I managed to find the solution here:

github.com

lvwerra/trl/blob/17f22c1c205a207b47cd5b913dd06c75e50b01e0/examples/research_projects/stack_llama/scripts/supervised_finetuning.py#L168


      
              learning_rate=args.learning_rate,
              lr_scheduler_type=args.lr_scheduler_type,
              warmup_steps=args.num_warmup_steps,
              gradient_accumulation_steps=args.gradient_accumulation_steps,
              gradient_checkpointing=not args.no_gradient_checkpointing,
              fp16=not args.no_fp16,
              bf16=args.bf16,
              weight_decay=args.weight_decay,
              run_name="llama-7b-finetuned",
              report_to="wandb",
              ddp_find_unused_parameters=False,
          )
          
          model = AutoModelForCausalLM.from_pretrained(
              args.model_path, load_in_8bit=True, device_map={"": Accelerator().process_index}
          )
          
          trainer = SFTTrainer(
              model=model,
              args=training_args,
              train_dataset=train_data,

Colorful · August 6, 2023, 7:47pm

Thanks, It works well

Colorful · August 6, 2023, 7:47pm

Say reply below

SUNM · August 21, 2023, 1:40am

hi @nguyenvlm , @Colorful , I am trying to run the code with multiple GPUs , I need to run to the specific number like 1,2 (not gpu:0 as it is full most of the time). would you please help me with that?
or share your code setting which works with multiple gpus? many thanks

import os
import torch
import pandas as pd
from datasets import load_dataset
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "12355"
os.environ['CUDA_VISIBLE_DEVICES'] = "1,2"
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

model_name="//sentence-transformers/Llama-2-7b-hf"

# The instruction dataset to use
# dataset_name = "mlabonne/guanaco-llama2-1k"
dataset_name = pd.read_parquet('/notebooks/output_data/data.parquet")

# Fine-tuned model name
new_model = "llama-2-7b-miniguanaco"


################################################################################
# bitsandbytes parameters
################################################################################

# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config, use_cache=False, device_map={"": 0}
)
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
################################################################################
# QLoRA parameters
###########################################################################
# LoRA config based on QLoRA paper
peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.1,
        r=64,
        bias="none",
        task_type="CAUSAL_LM",
)


# prepare model for training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)


from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="llama-7-int4-dolly",
    num_train_epochs=1,
    per_device_train_batch_size=6 if use_flash_attention else 4,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    optim="paged_adamw_32bit",
    logging_steps=10,
    save_strategy="epoch",
    learning_rate=2e-4,
    bf16=True,
    tf32=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="constant",
    disable_tqdm=True # disable tqdm since with packing values are in correct
)


from trl import SFTTrainer

max_seq_length = 1056 # max sequence length for model and packing of the dataset

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    dataset_text_field="text",
    packing=True,
    # formatting_func=format_instruction,
    args=args,
)

output_dir = "~/Llama-2-7b-hf_results/v2/"

trainer.train() # there will not be a progress bar since tqdm is disabled

SUNM · August 22, 2023, 8:23am

@Colorful , I hope you are well. sorry would you please share with me your code with multiple GPUs, does it work for you with multiple? the results are good? many thanks for your helpppppp

Ichsan2895 · August 22, 2023, 8:33am

Repost from this thread : Multi-gpu training example?

I use QloRa for Fine tuning with multiple GPUs.
Now it utilized all GPUs

!pip install bitsandbytes==0.41.1
!pip install transformers==4.31.0
!pip install peft==0.4.0 
!pip install accelerate==0.21.0 einops==0.6.1 evaluate==0.4.0 scikit-learn==1.2.2 sentencepiece==0.1.99

*change qlora.py
device_map='auto' => device_map = {"": "cuda:" + str(int(os.environ.get("LOCAL_RANK") or 0))}

!accelerate launch qlora.py --model_name_or_path="meta-llama/Llama-2-7b-chat-hf" \
    --dataset="/workspace/your_dataset.csv" \
    --do_eval=True --eval_steps=500 --lr_scheduler_type="cosine" \
    --learning_rate=0.0002 --use_auth_token=True \
    --evaluation_strategy=steps --eval_dataset_size=512 --do_mmlu_eval=True \
    --gradient_checkpointing=True --ddp_find_unused_parameters=False

Tested in Runpod environment with Python 3.10 and Torch 2.0.0+cu117

when gradient_checkpointing is True, a little bit slow. But it spread all GPU VRAM usage.

For example if one GPU, it needs 20 GBs of VRAM.
If two GPUs, it needs 20/2=10 GB/GPU,
If three GPUs, it needs 20/3 GB=6,67 GB/GPU.

Got 15 seconds/iters

Compared to

!accelerate launch qlora.py --model_name_or_path="meta-llama/Llama-2-7b-chat-hf" \
    --dataset="/workspace/your_dataset.csv" \
    --do_eval=True --eval_steps=500 --lr_scheduler_type="cosine" \
    --learning_rate=0.0002 --use_auth_token=True \
    --evaluation_strategy=steps --eval_dataset_size=512 --do_mmlu_eval=True \
    --gradient_checkpointing=False

when gradient_checkpointing is False, yet it will be faster. But it more consumes more GPU VRAM.

For example if one GPU, it needs 20 GBs of VRAM.
If two GPUs, it needs 20x2=40 GB total,
If three GPUs, it needs 20x3 GB=60 GB total.

Got 10 seconds/iter. But it consumes gpu usage multipled by number of GPUs.

Compared to the vanilla one (original)

!python3.10 qlora.py --model_name_or_path="meta-llama/Llama-2-7b-chat-hf" \
    --dataset="/workspace/your_dataset.csv" \
    --do_eval=True --eval_steps=500 --lr_scheduler_type="cosine" \
    --learning_rate=0.0002 --use_auth_token=True \
    --evaluation_strategy=steps --eval_dataset_size=512 --do_mmlu_eval=True

Got 55 seconds/iter. So it is very slow compared previous method.

Colorful · August 22, 2023, 9:41am

@SUNM Hi, I use the accelerate lib to train with multiple GPUs. Here is my source code:

# coding=utf-8
# Implements parameter-efficient or full parameters supervised fine-tuning for LLaMa model.
# This code is inspired by
# https://github.com/tatsu-lab/stanford_alpaca/blob/main/train.py and https://www.mlexpert.io/machine-learning/tutorials/alpaca-fine-tuning


import transformers
from transformers import (
    AutoTokenizer, 
    AutoModelForCausalLM,
    DataCollatorForSeq2Seq,
    Trainer,
    Seq2SeqTrainer,
    HfArgumentParser,
    Seq2SeqTrainingArguments,
    BitsAndBytesConfig,
)

from peft import (
    LoraConfig,
    get_peft_model,
    get_peft_model_state_dict,
    prepare_model_for_int8_training,
    set_peft_model_state_dict,
)

import torch
import os
import evaluate
import functools
from datasets import load_dataset
import bitsandbytes as bnb
import logging
import json
import copy
from typing import Dict, Optional, Sequence
from dataclasses import dataclass, field


# Lora settings
LORA_R = 8
LORA_ALPHA = 16
LORA_DROPOUT= 0.05
LORA_TARGET_MODULES = [
    "q_proj",
    "v_proj",
]


@dataclass
class ModelArguments:
    model_name_or_path: Optional[str] = field(default="elinas/llama-7b-hf-transformers-4.29")


@dataclass
class DataArguments:
    data_path: str = field(default=None, metadata={"help": "Path to the training data."})
    train_file: str = field(default=None, metadata={"help": "Path to the evaluation data."})
    eval_file: str = field(default=None, metadata={"help": "Path to the evaluation data."})
    cache_path: str = field(default=None, metadata={"help": "Path to the cache directory."})
    num_proc: int = field(default=4, metadata={"help": "Number of processes to use for data preprocessing."})


@dataclass
class TrainingArguments(transformers.TrainingArguments):
    # cache_dir: Optional[str] = field(default=None)
    optim: str = field(default="adamw_torch")
    model_max_length: int = field(
        default=512,
        metadata={"help": "Maximum sequence length. Sequences will be right padded (and possibly truncated)."},
    )
    is_lora: bool = field(default=True, metadata={"help": "Whether to use LORA."})


def tokenize(text, tokenizer, max_seq_len=512, add_eos_token=True):
    result = tokenizer(
        text,
        truncation=True,
        max_length=max_seq_len,
        padding=False,
        return_tensors=None,
    )
    if (
        result["input_ids"][-1] != tokenizer.eos_token_id
        and len(result["input_ids"]) < max_seq_len
        and add_eos_token
    ):
        result["input_ids"].append(tokenizer.eos_token_id)
        result["attention_mask"].append(1)

    if add_eos_token and len(result["input_ids"]) >= max_seq_len:
        result["input_ids"][max_seq_len - 1] = tokenizer.eos_token_id
        result["attention_mask"][max_seq_len - 1] = 1

    result["labels"] = result["input_ids"].copy()
    return result


def main():
    parser = HfArgumentParser((ModelArguments, DataArguments, TrainingArguments))
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()

    if training_args.is_lora:
        model = AutoModelForCausalLM.from_pretrained(
            model_args.model_name_or_path,
            cache_dir=data_args.cache_path,
            torch_dtype=torch.float16,
            load_in_8bit=True,
            quantization_config=BitsAndBytesConfig(
                load_in_8bit=True,
                llm_int8_threshold=6.0
            ),
        )
        model = prepare_model_for_int8_training(model)

        config = LoraConfig(
            r=LORA_R,
            lora_alpha=LORA_ALPHA,
            target_modules=LORA_TARGET_MODULES,
            lora_dropout=LORA_DROPOUT,
            bias="none",
            task_type="CAUSAL_LM",
        )
        model = get_peft_model(model, config)
    else:
        model = AutoModelForCausalLM.from_pretrained(
            model_args.model_name_or_path,
            torch_dtype=torch.float16,
            cache_dir=data_args.cache_path,
        )
    model.config.use_cache = False

    tokenizer = AutoTokenizer.from_pretrained(
        model_args.model_name_or_path,
        cache_dir=data_args.cache_path,
        model_max_length=training_args.model_max_length,
        padding_side="left",
        use_fast=True,
    )
    tokenizer.pad_token = tokenizer.unk_token
    
    # Load dataset

    def generate_and_tokenize_prompt(sample):
        input_text = sample["input"]
        target_text = sample["output"] + tokenizer.eos_token
        full_text = input_text + target_text
        tokenized_full_text = tokenize(full_text, tokenizer, max_seq_len=training_args.model_max_length)
        tokenized_input_text = tokenize(input_text, tokenizer, max_seq_len=training_args.model_max_length)
        input_len = len(tokenized_input_text["input_ids"]) # This a bug of llamatokenizer that it does not add eos token
        tokenized_full_text["labels"] = [-100] * input_len + tokenized_full_text["labels"][input_len:]
        return tokenized_full_text

    data_files = {}
    if data_args.train_file is not None:
        data_files["train"] = data_args.train_file
    if data_args.eval_file is not None:
        data_files["eval"] = data_args.eval_file
    
    dataset = load_dataset(data_args.data_path, data_files=data_files)
    train_dataset = dataset["train"]
    eval_dataset = dataset["eval"]
    train_dataset = train_dataset.map(generate_and_tokenize_prompt, num_proc=data_args.num_proc)
    eval_dataset = eval_dataset.map(generate_and_tokenize_prompt, num_proc=data_args.num_proc)
    data_collator = DataCollatorForSeq2Seq(tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True)

    # Evaluation metrics
    def compute_metrics(eval_preds, tokenizer):
        metric = evaluate.load('exact_match')
        preds, labels = eval_preds
        # In case the model returns more than the prediction logits
        if isinstance(preds, tuple):
            preds = preds[0]

        decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True, clean_up_tokenization_spaces=False)

        # Replace -100s in the labels as we can't decode them
        labels[labels == -100] = tokenizer.pad_token_id
        decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True, clean_up_tokenization_spaces=False)

        # Some simple post-processing
        decoded_preds = [pred.strip() for pred in decoded_preds]
        decoded_labels = [label.strip() for label in decoded_labels]

        result = metric.compute(predictions=decoded_preds, references=decoded_labels)
        return {'exact_match': result['exact_match']} 
    
    compute_metrics_fn = functools.partial(compute_metrics, tokenizer=tokenizer)

    # Training
    trainer = Trainer(
        model=model, 
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,  
        args=training_args,
        data_collator=data_collator,
        compute_metrics=compute_metrics_fn,
    )
    trainer.train()
    trainer.save_state()
    trainer.save_model(output_dir=training_args.output_dir)
    tokenizer.save_pretrained(save_directory=training_args.output_dir)


if __name__ == "__main__":
    main()

And I use the below bash script to launch py code:

accelerate launch llama2_sft.py \
    --model_name_or_path <path to llama> \
    --data_path <path to dataset> \
    --train_file <train file name> \
    --eval_file <eval file name> \
    --is_lora True \
    --model_max_length 512 \
    --cache_path <path to cache> \
    --do_train \
    --do_eval False \
    --fp16 True \
    --output_dir <output path> \
    --num_train_epochs 2 \
    --per_device_train_batch_size <set according to your GPUs scale> \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --eval_steps 10 \
    --save_steps <Set according to your data size> \
    --learning_rate 2e-4 \
    --lr_scheduler_type "cosine" \
    --logging_steps 10 \
    --ddp_find_unused_parameters False \

SUNM · August 23, 2023, 4:51am

@Colorful many many thanks for your help. I really appreciate your reply. did you get good results from the code? what did you use for the inference to evaluate the model? and why you used dataloaderseq2seq? it is the Causal model, does it make sense to use seq2seq?
and regarding the data, my data is in csv file can I pass it to the code in csv format when we use — train_file .your data has 2 sections sample[“input”] which is the prompt and the sample[“output”] is the completion section?

SUNM · August 25, 2023, 7:48am

@Colorful , I am waiting for your reply.many thanks

Topic		Replies	Views
Issue with LoRA Adapter Loading on Multiple GPUs during Fine-Tuning with Accelerate and SFTTrainer 🤗Accelerate	3	985	September 18, 2024
LoRA Finetuning RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! 🤗Transformers	4	28	June 16, 2025
Train LoRA adapters on Multiple Datasets in Parallel for llama7B 🤗Transformers	0	958	November 1, 2023
ORPO Trainer giving error when fine-tuning Llama3-8b in Multi-GPU environment 🤗Accelerate	8	1186	May 27, 2024
Llama 3 peft ddp Beginners	2	2463	August 20, 2024

Training llama with Lora on multiple GPUs may exist bug

Related topics