ORPO Trainer giving error when fine-tuning Llama3-8b in Multi-GPU environment

Using following code for fine-tuning Llama3-8B with ORPO trainer on Kaggle Notebook with 2 T4 GPUs. The Trainer gives an error and I also noticed only 1 GPU being utilized. My understanding is accelerate distributes training on multiple GPUs with device_map="auto" setting without any additional code, from what I have read in multiple articles and discussions. Note that this is running on notebook and not as a script (also not using notebook_launcher). Please advise any additional setting required for running on multiple GPUs?

Code:

base_model = "meta-llama/Meta-Llama-3-8B"
# QLoRA config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch_dtype,
    bnb_4bit_use_double_quant=True,
)

# LoRA config
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, token=HF_TOKEN)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map="auto",
    token=HF_TOKEN,
    attn_implementation=attn_implementation
)
model, tokenizer = setup_chat_format(model, tokenizer)
model = prepare_model_for_kbit_training(model)

%%time
orpo_args = ORPOConfig(
    learning_rate=8e-6,
    lr_scheduler_type="linear",
    max_length=1024,
    max_prompt_length=512,
    beta=0.1,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,
    optim="paged_adamw_8bit",
    num_train_epochs=1,
    evaluation_strategy="steps",
    eval_steps=0.2,
    logging_steps=1,
    warmup_steps=10,
    report_to="none",
    output_dir="./results/",
    remove_unused_columns=False,
    fp16=True,
    bf16=False,
)

trainer = ORPOTrainer(
    model=model,
    args=orpo_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    peft_config=peft_config,
    tokenizer=tokenizer,
)

trainer.train()

Error

ValueError: Calculated loss must be on the original device: cuda:0 but device in use is cuda:1

Below prints show only 1 GPU used and not running in distributed mode:

print(orpo_args.n_gpu, orpo_args.parallel_mode)
1 ParallelMode.NOT_PARALLEL

I figured to use multi-GPU by changing a few settings like device_map and also used notebook_launcher to use accelerate capability in Kaggle notebook. However, I got OOM error for fine-tuning 4-bit quantized Llama3-8B on 2 T4 GPUs. I’d think for 4-bit quantized FT of 8B parms, 16GB of 1 GPU is sufficient and hence 2 GPUs with distributed training should not give OOM error. I noticed the GPU usage was 11.5 GB (screenshot given) on each of the 2 GPUs right after loading checkpoints of the model which seems strange and the FT trainer failed soon after.

Notebook function code:

def main():
    
    from transformers import BitsAndBytesConfig
    from trl import ORPOConfig, ORPOTrainer, setup_chat_format
    from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
    from accelerate import Accelerator

    accelerator = Accelerator()
    device_map = {"": accelerator.process_index}

    # QLoRA config
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
    )

    # LoRA config
    peft_config = LoraConfig(
        r=16,
        lora_alpha=32,
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
    )
    
    base_model = "meta-llama/Meta-Llama-3-8B"
    new_model = "Llama-3-8B_FT_ORPO"
    
    tokenizer = AutoTokenizer.from_pretrained(base_model, token=HF_TOKEN)

    # Load model
    model = AutoModelForCausalLM.from_pretrained(
        base_model,
        quantization_config=bnb_config,
#         device_map="auto",
        device_map=device_map,
        token=HF_TOKEN,
        attn_implementation="eager"
    )
    
    model, tokenizer = setup_chat_format(model, tokenizer)
    model = prepare_model_for_kbit_training(model)
    
    dataset_name = "mlabonne/orpo-dpo-mix-40k"
    dataset = load_dataset(dataset_name, split="all")
    dataset = dataset.shuffle(seed=42).select(range(30)) # Only use 30 samples for test

    def format_chat_template(row):
        row["chosen"] = tokenizer.apply_chat_template(row["chosen"], tokenize=False)
        row["rejected"] = tokenizer.apply_chat_template(row["rejected"], tokenize=False)
        return row

    dataset = dataset.map(
        format_chat_template,
        num_proc= os.cpu_count(),
    )
    dataset = dataset.train_test_split(test_size=0.1)
    
#     torch.cuda.empty_cache()
    
    orpo_args = ORPOConfig(
        learning_rate=8e-6,
        lr_scheduler_type="linear",
        max_length=1024,
        max_prompt_length=512,
        beta=0.1,
        per_device_train_batch_size=2,
        per_device_eval_batch_size=2,
        gradient_accumulation_steps=4,
        optim="paged_adamw_8bit",
        num_train_epochs=1,
        evaluation_strategy="steps",
        eval_steps=0.2,
        logging_steps=1,
        warmup_steps=10,
        report_to="none",
        output_dir="./results/",
        remove_unused_columns=False,
        fp16=True,
        bf16=False,
        ddp_find_unused_parameters=False,
        gradient_checkpointing=True,
#         gradient_checkpointing_kwargs = {"use_reentrant": False}, #must be false for DDP
    )

    trainer = ORPOTrainer(
        model=model,
        args=orpo_args,
        train_dataset=dataset["train"],
        eval_dataset=dataset["test"],
        peft_config=peft_config,
        tokenizer=tokenizer,
    )
    print(f'n_gpu: {orpo_args.n_gpu}; Mode: {orpo_args.parallel_mode}')
    print(f'Num Processes: {accelerator.num_processes}; Device: {accelerator.device}; Process Index: {accelerator.process_index}')
    print(f'Accel Type: {accelerator.distributed_type}')

    trainer.train()
    trainer.save_model(new_model)
    
notebook_launcher(main, num_processes=2)

After the OOM error, I tried to see if FSDP can be used by adding following 2 arguments to the ORPOConfig but it resulted in an AttributeError

        fsdp="full_shard",
        fsdp_config={'min_num_params': 2000, 'offload_params': False, 'sharding_strategy': 1},

Error:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
    fn(i, *args)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/launch.py", line 626, in __call__
    self.launcher(*args)
  File "/tmp/ipykernel_34/1091546508.py", line 122, in main
    trainer.train()
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train
    return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2001, in _inner_training_loop
    self._fsdp_qlora_plugin_updates()
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 4425, in _fsdp_qlora_plugin_updates
    fsdp_plugin.auto_wrap_policy = fsdp_auto_wrap_policy(self.model)
  File "/opt/conda/lib/python3.10/site-packages/peft/utils/other.py", line 396, in fsdp_auto_wrap_policy
    transformer_cls = FullyShardedDataParallelPlugin.get_module_class_from_name(model, layer_class)
AttributeError: type object 'FullyShardedDataParallelPlugin' has no attribute 'get_module_class_from_name'

Questions:

  1. How much RAM should be enough to FT a 4-bit quantized 8B on single / multi-GPU environment?
  2. How to force accelerate to use FSDP when running in a notebook environment where config file is not used
  3. Is the Error highlighted above due to any incorrect argument in the code or anything else missing?

@muellerzr - Appreciate your thoughts!

I’m also pretty new to fine-tuning, so please forgive me if these answers are inaccurate.

  1. It might not be just the model that’s taking up so much memory. The dataset can also heavily influence memory usage, especially if the data contained is too large for the model to process. By that, I don’t mean the number of rows in the dataset, but rather, the contents of each row. I did a basic length check on the dataset you’re using and it seems that some entries in the dataset are much too long. Even if this doesn’t solve the issue, I would recommend filtering based on length (or tokenized length). The OOM issue I ran into during QLoRA finetuning (with accelerate DDP) was fixed with this approach, although I’m not sure if the same applies in this case.

image
image

  1. There’s a HuggingFace page you might want to look into that also has some sample code. The code below was obtained directly from this page.
from accelerate import FullyShardedDataParallelPlugin
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig

fsdp_plugin = FullyShardedDataParallelPlugin(
    state_dict_config=FullStateDictConfig(offload_to_cpu=False, rank0_only=False),
    optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=False, rank0_only=False),
)

accelerator = Accelerator(fsdp_plugin=fsdp_plugin)

Appreciate your inputs @Chahnwoo !

From the articles and blogs I read a 4-bit QLoRA 7B model should take about 7 GB VRAM but I saw that the 4-bit llama3-8B model took 11.5+ GB VRAM for loading (even before training and the data loading started - I was loading only 10 samples for a dry run). Also another peculiar thing I noticed was, when the model loading was in progress, after the final 4th checkpoint is loaded it showed 5.5 GB RAM used and then next moment it simply spiked to almost double 11+ GB, seemed like a double copy happening, not sure why.

Here are a few things I tried and observed:

  • Tried FSDP with and without Plugin and initially ran into the error mentioned here. When I tried installing libraries as discussed in the thread, I again hit OOM error.
  • Next I tried with DeepSpeed following instructions in this page.
  • Firstly, when using DeepSpeed, it did not accept device_map argument for loading the model. When I took that out, the model now took about 8GB RAM on each GPU (somewhat in line with what I mentioned earlier)
  • Second, I again hit OOM error and this time I changed per_device_train_batch_size from 2 to 1 and finally it started to train. With about 1000 samples, the trainer took 2.5 hours to complete. I’m guessing tuning a QLoRA model with DeepSpeed may not see a huge improvement in training time given the cpu-offloading was not being really used (the time may be almost comparable to a single-GPU training time).

Hello again,

I haven’t used the ORPO trainer myself, so I’m not sure what the problem seems to be when it comes to memory.

In terms of training time, could you try setting a higher value for your eval_steps (something like 100 or even 500)? From my understanding, the value that you assign to that parameter determines the number of steps between each evaluation. Setting a larger value would mean less evaluations performed, which could reduce the training time.

Anyone here had success on using ORPO on T4x2 ?

@Chahnwoo - eval_steps 0.2 indicates it evaluates 5 times over the training process or the overall steps it takes to run the training. I guess the time being taken could be because of the limited RAM available.

@celsowm - I could get it running on T4x2 (15GB VRAM each), although I’m still evaluating the performance. Setting per_device_train_batch_size to 1 is the only way I could get it working.

@mallik30 - The documentation for the transformers training arguments suggests that the eval_steps argument accepts int values to indicate the number of steps between evaluations. I apologize if it is different for ORPO.

Since your dataset is mostly comprised of data that isn’t too lengthy, you might want to look into packing, which might make your process more efficient.

I’ve created an english version of this dataset: