11B model gets OOM after using deepspeed zero 3 setting with 8 32G V100

Charlie911 · April 8, 2024, 5:30pm

Can anyone help me with one-node distributed training? The idea of my custom model is to merge some layers of 2 7B llama models like MOE models do. So that yields an 11B model. I’m loading the model in torch.float16 except for the lm_head weight which is in torch.float32. I’m assuming this would require at least 11*2*2 = 44 GB of GPU RAM for training. I used estimate_zero3_model_states_mem_needs_all_live to check out the zero3 setting recommended and created an accelerate config file accordingly.

The model can be successfully loaded (and I’m guessing into 8 GPUs evenly cause their used RAMs are all somewhere around 6000 MiB). However, once the forward call starts, the code fails due to OOM error. BTW, 8 GPUs seem a bit too extravagant for a model of this size, but I got no luck with 2 & 4 GPUs either.
This is my config.yaml

In which compute environment are you running? This machine                                                                                                                                                                                                 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Which type of machine are you using?                                                                                                                                                                         multi-GPU                                                                                                                                                                                                    How many different machines will you use (use more than 1 for multi-node training)? [1]: 1                                                                                                                   Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: NO                                                                            Do you wish to optimize your script with torch dynamo?[yes/NO]:NO                                                                                                                                            Do you want to use DeepSpeed? [yes/NO]: yes                                                                                                                                                                  Do you want to specify a json file to a DeepSpeed config? [yes/NO]: NO                                                                                                                                       ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------What should be your DeepSpeed's ZeRO optimization stage?                                                                                                                                                     3                                                                                                                                                                                                            ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Where to offload optimizer states?                                                                                                                                                                           none                                                                                                                                                                                                         ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Where to offload parameters?                                                                                                                                                                                 none                                                                                                                                                                                                         How many gradient accumulation steps you're passing in your script? [1]: 1                                                                                                                                   Do you want to use gradient clipping? [yes/NO]: NO                                                                                                                                                           Do you want to save 16-bit model weights when using ZeRO Stage-3? [yes/NO]: yes                                                                                                                              Do you want to enable

deepspeed.zero.Init

when using ZeRO Stage-3 for constructing massive models? [yes/NO]: NO  How many GPU(s) should be used for distributed training? [1]:8 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Do you wish to use FP16 or BF16 (mixed precision)? fp16

This is my training script.
I implemented my custom model by overwriting and inheriting classes in transformers library. The 2 models to be merged would be loaded using LlamaForCausalLM.from_pretrained() in MoeLlamaForCausalLM class. I would then copy the weights into my model layer by layer.

    model = MoeLlamaForCausalLM(config, trainable=True)

    for name, param in model.named_parameters():
        if param.ndim == 1:
            # cast the small parameters (e.g. layernorm) to fp32 for stability
            param.data = param.data.to(torch.float32)
        if "lora" in name:
            param.requires_grad = True
        else:
            param.requires_grad = False

    tokenizer = AutoTokenizer.from_pretrained(...)
    tokenizer.pad_token = tokenizer.eos_token

    dataset = load_dataset(...)

    training_config = {
        "learning_rate": 5e-5,
        "gradient_accumulation_steps": 1,
        "batch_size": 2,
        "num_epochs": 1,
        "save_step": 100,
        "save_dir": "./accelerate_output",
        "lr_scheduler_type": "linear",
        "num_warmup_steps": 0
    }

    accelerator = Accelerator(mixed_precision='fp16')

    if accelerator.is_main_process:
        datasets.utils.logging.set_verbosity_warning()
        transformers.utils.logging.set_verbosity_info()
    else:
        datasets.utils.logging.set_verbosity_error()
        transformers.utils.logging.set_verbosity_error()

    # Creates Dummy Optimizer if `optimizer` was specified in the config file else creates Adam Optimizer
    optimizer_cls = (
        torch.optim.AdamW
        if accelerator.state.deepspeed_plugin is None
        or "optimizer" not in accelerator.state.deepspeed_plugin.deepspeed_config
        else DummyOptim
    )
    optimizer = optimizer_cls(model.parameters(), lr=training_config["learning_rate"])

    train_dataloader = DataLoader(dataset, shuffle=True, batch_size=training_config["batch_size"])

    num_training_steps = training_config["num_epochs"] * len(train_dataloader)
     # Creates Dummy Scheduler if `scheduler` was specified in the config file else creates `args.lr_scheduler_type` Scheduler
    if (
        accelerator.state.deepspeed_plugin is None
        or "scheduler" not in accelerator.state.deepspeed_plugin.deepspeed_config
    ):
        lr_scheduler = get_scheduler(
            training_config["lr_scheduler_type"],
            optimizer=optimizer,
            num_warmup_steps=training_config["num_warmup_steps"],
            num_training_steps=num_training_steps,
        )
    else:
        lr_scheduler = DummyScheduler(
            optimizer, total_num_steps=num_training_steps, warmup_num_steps=training_config["num_warmup_steps"]
        )

    train_dataloader, model, optimizer, lr_scheduler = accelerator.prepare(
        train_dataloader, model, optimizer, lr_scheduler
    )

    progress_bar = tqdm(range(training_config["num_epochs"] * len(train_dataloader)), disable=not accelerator.is_main_process)
    step = 0
    model.train()
    for epoch in range(training_config["num_epochs"]):
        for batch in train_dataloader:
            optimizer.zero_grad()
            outputs = model(**batch)
            loss = outputs.loss
            accelerator.backward(loss)

            optimizer.step()
            lr_scheduler.step()
            accelerator.step()

            progress_bar.update(1)
            step += 1

        if accelerator.is_main_process and step % training_config["save_step"] == 0:
            accelerator.wait_for_everyone()
            unwrapped_model = accelerator.unwrap_model(model)
            unwrapped_model.save_pretrained(
                training_config["save_dir"],
                save_function=accelerator.save,
                state_dict=accelerator.get_state_dict(model),
                is_main_process=accelerator.is_main_process,
            )

Thanks in advance for any advice!

lixiqqq123 · April 26, 2025, 12:35pm

Have you solved the problem? I have encountered a similar problem recently. I have an 8B model here. On 8xA100, no matter stage1, 2, 3, it will oom

John6666 · April 26, 2025, 1:12pm

Recent similar issues also…

Topic		Replies	Views
Llama2-70b-chat loading Cuda Out of Memory Models	0	1215	July 26, 2023
OOM when I using torch.nn.parallel.DistributedDataParallel to train LLAMA-7B Beginners	0	719	May 12, 2023
LLaMA2 7B uses > 128 GB of GPU Ram and fails with OOM or Loss Scale Minimum 🤗Transformers	3	5558	August 17, 2023
Fine tune Meta-Llama-3.1-8B OOM error after the 1st training step Models	0	158	September 6, 2024
Accelerate throws CUDA: OOM 🤗Accelerate	0	417	August 22, 2024

11B model gets OOM after using deepspeed zero 3 setting with 8 32G V100

Related topics