Using 3 GPUs for training with Trainer() of transformers

Hello,

I have been trying to use my 3 GPUs to train a model (tiiuae/falcon-7b) with a dataset (clips/mqa) for multiple days because of a CUDA out of memory error.

When i use model.to("cuda:0"), the GPU with id 0 has 100% consommation and memory usage.
This is the same for GPUs 1 and 2.

Then, i found that we could put devices_ids directly to nn.DataParallel(model, devices_ids[0,1,2]). When i put only one GPU, the training goes on it, but as soon as i put 2 or 3, the training is done on the first one only.

torch.cuda.empty_cache()
    training_args = TrainingArguments("test-trainer",
        per_device_train_batch_size=10,
        per_device_eval_batch_size=10,
    )
    print("parallel_mode: ", training_args.parallel_mode)
    print("n_gpus: ", training_args.n_gpu)

Parallel_mode : NOT_DISTRIBUTED
n_gpu : 3

Here is my full code :

from transformers import TrainingArguments, AutoModelForSequenceClassification, Trainer, AutoTokenizer, DataCollatorWithPadding
from datasets import load_dataset
import os
import torch
from GPUtil import showUtilization as gpu_usage
import torch.nn as nn


def tokenize_function(dataset):
    answers = []
    for answer in dataset["answers"]:
        answers.append(answer[0]['text'])
    test = tokenizer(dataset["name"], answers, truncation=True)
    return test

try:
    torch.cuda.empty_cache()
    training_args = TrainingArguments("test-trainer",
        per_device_train_batch_size=10,
        per_device_eval_batch_size=10,
    )
    print("parallel_mode: ", training_args.parallel_mode)
    print("n_gpus: ", training_args.n_gpu)

    checkpoint = "tiiuae/falcon-7b"
    model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2, trust_remote_code=True)


    tokenizer = AutoTokenizer.from_pretrained(checkpoint)

    raw_datasets = load_dataset("clips/mqa", language="fr")
    tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
    gpu_usage()

    # model = model.to(torch.device("cuda"))  # Move model to the first GPU
    # model = nn.DataParallel(model)  # Wrap model with nn.DataParallel
    trainer = Trainer(
        # model.module,
        model,
        training_args,
        train_dataset=tokenized_datasets['train'],
        # eval_dataset=tokenized_datasets['validation'],
        data_collator=data_collator,
        tokenizer=tokenizer,
        # n_gpu=3,
    )

    trainer.train()
    
    gpu_usage()
    trainer.save_model("test")

except Exception as e:
    print("dans erreur")
    gpu_usage()
    print("\033[91mErreur trainer.train(): {}\033[0m".format(e))

Has someone already seen this issue?

Thank you for your help,
Paul

What are your GPU specs? If each of your GPUs has around 12 GB of RAM like mine do, I believe you won’t be able to fit the model on a single GPU. My advice is that you split layers of the model on several GPUs so that it fits and then you will be able to train the model. Note that this approach severely lowers GPU utilization since the bottleneck becomes the memory read/write operation.

Here is a snippet I used to load the instruction fine-tuned version of Falcon 7B on 4 GPUs with help of accelerate library:

from transformers import AutoModelForCausalLM
from accelerate import dispatch_model, infer_auto_device_map
from accelerate.utils import get_balanced_memory

model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b-instruct", trust_remote_code=True, device_map='auto', torch_dtype=torch.bfloat16)

max_memory = get_balanced_memory(
    model,
    max_memory=None,
    no_split_module_classes=["DecoderLayer", "Attention", "MLP", "LayerNorm", "Linear"],
    dtype='float16',
    low_zero=False,
)

device_map = infer_auto_device_map(
    model,
    max_memory=max_memory,
    no_split_module_classes=["DecoderLayer", "Attention", "MLP", "LayerNorm", "Linear"],
    dtype='float16'
)

model = dispatch_model(model, device_map=device_map)

You can pip install accelerate to use the required functions.

Also, when you use the Trainer API, it tends to create copies of the model on each GPU which will likely result in a CUDA OOM error since your model is very large.

after dispatch_model
do I need to use accelerator.prepare() API for the learning_rate, datacollator etc?