LLaMA2 7B uses > 128 GB of GPU Ram and fails with OOM or Loss Scale Minimum

Goal

Continue pretraining of the meta/llama2-7b-hf transformer on custom text data.

Software Approach

  • datasets 2.13 to load data
  • Trainer from transformers 4.32.0.dev0 for training
  • deepspeed 1.10 for multi-gpu training

Hardware Details

  • 1 Machine
  • either 4x Nvidia V100 (32G)
  • or 8x Nvidia GTX 2080 TI (11GB)

Problem

  • Code exits in ZeRO Stage 2 due to OOM of 32GB for each GPU

  • Code exits in ZeRO Stage 3 due to OOM with 32GB for each GPU

  • Code exits in ZeRO Stage 2 + CPU Offload with OOM if CPU Memory is below 178GB

  • Code exits in ZeRO Stage 3 + CPU Offload with OOM if CPU Memory is below 178GB

  • Code exits eventually with Stage 2/3 + CPU Offload due to loss scale reduced to 1 (CPU Ram set to 256GB

  • Code exists in any Stage with OOM if train_batch_size > 1

Questions

Why is every GPU fully used up when using Stage2/3 without CPU Offload and ends in an OOM?

Why do I need > 172GB memory allocated to my SLURM process to use Stage 2/3 + CPU Offloading

Why does the loss scale rapidly decrease from 65536 to 1 and eventually stop, and how can I prevent this?

Why am I forced to use a micro train batch size of 1 in any configuration to not encounter OOM.

Setup

Training code exactly follows CLM-Training Example from the Transformers Github.

Slurm Script used to start the job (some information are redacted):

#!/bin/bash
#SBATCH --job-name=llama2-7b-3gpu        # name
#SBATCH --nodes=1                        # nodes
#SBATCH --ntasks-per-node=1              # crucial - only 1 task per dist per node!
#SBATCH --cpus-per-task=8
#SBATCH --partition=clara
#SBATCH --mem=256G
#SBATCH --gres=gpu:v100:3                # number of gpus
#SBATCH --output=logs/%x-%j.out          # output file name
#SBATCH --mail-type=ALL

module load Python
module load PyTorch
source .env/bin/activate

srun pip install git+https://github.com/huggingface/transformers
srun pip install typing-extensions
srun pip install datasets
srun pip install torch
srun pip install accelerate
srun pip install pytest
srun pip install scikit-learn
srun pip install evaluate
srun pip install sentencepiece
srun pip install deepspeed


# Torch Distributed Run Variables
export GPUS_PER_NODE=3
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=9907

export NCCL_DEBUG=INFO
export NCCL_IGNORE_DISABLED_P2P=1
# Model Arguments
MODEL_NAME=meta-llama/Llama-2-7b-hf
CACHE_DIR=./cache
USE_FAST_TOKENIZER=false
MODEL_REVISION=main
USE_AUTH_TOKEN=true
HUGGING_TOKEN=<REDACTED>
TORCH_DTYPE=auto
LOW_CPU_MEM_USAGE=false

# DataTraining Arguments
TRAIN_FILE=./input/health_information_systems_epub.md
MAX_TRAIN_SAMPLES=None
OVERWRITE_CACHE=false
VALIDATION_SPLIT_PERCENTAGE=5
PREPROCESSING_NUM_WORKERS=1
KEEP_LINEBREAKS=true

# Training Arguments
OUTPUT_DIR=./trained/7B
OVERWRITE_OUTPUT_DIR=true
DO_TRAIN=true
DO_EVAL=false
PER_DEVICE_TRAIN_BATCH_SIZE=1
PER_DEVICE_EVAL_BATCH_SIZE=1
BLOCK_SIZE=4096
EVALUATION_STRATEGY=steps
EVAL_STEPS=100
LEARNING_RATE=3e-4
WEIGHT_DECAY=0.1
ADAM_BETA1=0.9
ADAM_BETA2=0.95
ADAM_EPSILON=1e-5
MAX_GRAD_NORM=1.0
NUM_TRAIN_EPOCHS=3
LR_SCHEDULER_TYPE=cosine
WARMUP_STEPS=0
LOG_LEVEL=passive
SAVE_STRATEGY=steps
SAVE_STEPS=500
SAVE_TOTAL_LIMIT=1
NO_CUDA=false
SEED=42
FP16=false
BF16=false
HALF_PRECISION_BACKEND=auto
DDP_BACKEND=nccl
DEEPSPEED=./ds_configs/stage2_offload.json
OPTIM=adamw_torch

echo "srun --jobid $SLURM_JOBID bash -c \"NCCL_DEBUG=INFO deepspeed "
echo "--num_gpus=$GPUS_PER_NODE "
echo "03_train_llama2.py "
echo "--model_name $MODEL_NAME "
echo "--cache_dir $CACHE_DIR "
echo "--use_fast_tokenizer $USE_FAST_TOKENIZER "
echo "--model_revision $MODEL_REVISION "
echo "--use_auth_token $USE_AUTH_TOKEN "
echo "--hugging_token $HUGGING_TOKEN "
echo "--torch_dtype $TORCH_DTYPE "
echo "--low_cpu_mem_usage $LOW_CPU_MEM_USAGE "
echo "--train_file $TRAIN_FILE "
echo "--max_train_samples $MAX_TRAIN_SAMPLES "
echo "--overwrite_cache $OVERWRITE_CACHE "
echo "--validation_split_percentage $VALIDATION_SPLIT_PERCENTAGE "
echo "--preprocessing_num_workers $PREPROCESSING_NUM_WORKERS "
echo "--keep_linebreaks $KEEP_LINEBREAKS "
echo "--output_dir $OUTPUT_DIR "
echo "--overwrite_output_dir $OVERWRITE_OUTPUT_DIR "
echo "--do_train $DO_TRAIN "
echo "--do_eval $DO_EVAL "
echo "--per_device_train_batch_size $PER_DEVICE_TRAIN_BATCH_SIZE "
echo "--per_device_eval_batch_size $PER_DEVICE_EVAL_BATCH_SIZE "
echo "--block_size $BLOCK_SIZE "
echo "--evaluation_strategy $EVALUATION_STRATEGY "
echo "--eval_steps $EVAL_STEPS "
echo "--learning_rate $LEARNING_RATE "
echo "--weight_decay $WEIGHT_DECAY "
echo "--adam_beta1 $ADAM_BETA1 "
echo "--adam_beta2 $ADAM_BETA2 "
echo "--adam_epsilon $ADAM_EPSILON "
echo "--max_grad_norm $MAX_GRAD_NORM "
echo "--num_train_epochs $NUM_TRAIN_EPOCHS "
echo "--lr_scheduler_type $LR_SCHEDULER_TYPE "
echo "--warmup_steps $WARMUP_STEPS "
echo "--log_level $LOG_LEVEL "
echo "--save_strategy $SAVE_STRATEGY "
echo "--save_steps $SAVE_STEPS "
echo "--save_total_limit $SAVE_TOTAL_LIMIT "
echo "--no_cuda $NO_CUDA "
echo "--seed $SEED "
echo "--fp16 $FP16 "
echo "--bf16 $BF16 "
echo "--half_precision_backend $HALF_PRECISION_BACKEND "
echo "--local_rank $SLURM_PROCID "
echo "--ddp_backend $DDP_BACKEND "
echo "--deepspeed $DEEPSPEED "
echo "--optim $OPTIM\""

srun --jobid $SLURM_JOBID bash -c "NCCL_DEBUG=INFO deepspeed \
--num_gpus=$GPUS_PER_NODE \
03_train_llama2.py \
--model_name $MODEL_NAME \
--cache_dir $CACHE_DIR \
--use_fast_tokenizer $USE_FAST_TOKENIZER \
--model_revision $MODEL_REVISION \
--use_auth_token $USE_AUTH_TOKEN \
--hugging_token $HUGGING_TOKEN \
--torch_dtype $TORCH_DTYPE \
--low_cpu_mem_usage $LOW_CPU_MEM_USAGE \
--train_file $TRAIN_FILE \
--max_train_samples $MAX_TRAIN_SAMPLES \
--overwrite_cache $OVERWRITE_CACHE \
--validation_split_percentage $VALIDATION_SPLIT_PERCENTAGE \
--preprocessing_num_workers $PREPROCESSING_NUM_WORKERS \
--keep_linebreaks $KEEP_LINEBREAKS \
--output_dir $OUTPUT_DIR \
--overwrite_output_dir $OVERWRITE_OUTPUT_DIR \
--do_train $DO_TRAIN \
--do_eval $DO_EVAL \
--per_device_train_batch_size $PER_DEVICE_TRAIN_BATCH_SIZE \
--per_device_eval_batch_size $PER_DEVICE_EVAL_BATCH_SIZE \
--block_size $BLOCK_SIZE \
--evaluation_strategy $EVALUATION_STRATEGY \
--eval_steps $EVAL_STEPS \
--learning_rate $LEARNING_RATE \
--weight_decay $WEIGHT_DECAY \
--adam_beta1 $ADAM_BETA1 \
--adam_beta2 $ADAM_BETA2 \
--adam_epsilon $ADAM_EPSILON \
--max_grad_norm $MAX_GRAD_NORM \
--num_train_epochs $NUM_TRAIN_EPOCHS \
--lr_scheduler_type $LR_SCHEDULER_TYPE \
--warmup_steps $WARMUP_STEPS \
--log_level $LOG_LEVEL \
--save_strategy $SAVE_STRATEGY \
--save_steps $SAVE_STEPS \
--save_total_limit $SAVE_TOTAL_LIMIT \
--no_cuda $NO_CUDA \
--seed $SEED \
--fp16 $FP16 \
--bf16 $BF16 \
--half_precision_backend $HALF_PRECISION_BACKEND \
--local_rank $SLURM_PROCID \
--ddp_backend $DDP_BACKEND \
--optim $OPTIM \
--deepspeed $DEEPSPEED "

DeepSpeed Configuration:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": { 
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 2,
        "contiguous_gradients": true,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "allgather_bucket_size": 2e8,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        }
    },
    "gradient_clipping": 1.0,
    "steps_per_print": 500,
    "wall_clock_breakdown": false,
    "train_micro_batch_size_per_gpu": "auto"
}

Logs

DeepSpeed ds_report output (some output redacted):

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
[WARNING]  async_io: please install the libaio-devel package with yum
[WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO]....... [NO]
cpu_adagrad ............ [NO]....... [OKAY]
cpu_adam ............... [NO]....... [OKAY]
fused_adam ............. [NO]....... [OKAY]
fused_lamb ............. [NO]....... [OKAY]
quantizer .............. [NO]....... [OKAY]
random_ltd ............. [NO]....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO]....... [NO]
spatial_inference ...... [NO]....... [OKAY]
transformer ............ [NO]....... [OKAY]
stochastic_transformer . [NO]....... [OKAY]
transformer_inference .. [NO]....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/software/all/staging/PyTorch/1.12.1-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch']
torch version .................... 1.12.1
deepspeed install path ........... ['/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.10.0, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.7

Error Logs for Loss Scale:

[2023-07-22 15:59:32,657] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
[2023-07-22 15:59:37,102] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
[2023-07-22 15:59:41,552] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768, reducing to 16384
[2023-07-22 15:59:45,994] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16384, reducing to 8192
[2023-07-22 15:59:50,438] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8192, reducing to 4096
[2023-07-22 16:00:44,332] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4096, reducing to 2048
[2023-07-22 16:01:00,722] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2048, reducing to 1024

  0%|          | 0/102 [00:00<?, ?it/s]
  1%|          | 1/102 [00:05<09:58,  5.93s/it]
  2%|▏         | 2/102 [00:10<08:25,  5.06s/it]
...
47%|████▋     | 48/102 [07:31<04:15,  4.74s/it]Traceback (most recent call last):
  File "/home/<REDACTED>/LLaMA_Training/03_train_llama2.py", line 444, in <module>
    main()
  File "/home/<REDACTED>LLaMA_Training/03_train_llama2.py", line 410, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/transformers/trainer.py", line 1532, in train
    return inner_training_loop(
  File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/transformers/trainer.py", line 1805, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/transformers/trainer.py", line 2659, in training_step
    self.accelerator.backward(loss)
  File "/home/<REDACTED>LLaMA_Training/.env/lib/python3.10/site-packages/accelerate/accelerator.py", line 1847, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 176, in backward
    self.engine.step()
  File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2087, in step
    self._take_model_step(lr_kwargs)
  File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1994, in _take_model_step
    self.optimizer.step()
  File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1662, in step
    self._update_scale(self.overflow)
  File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1908, in _update_scale
    self.loss_scaler.update_scale(has_overflow)
  File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
    raise Exception(
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.

Error Logs for Stage 2/3 without CPU Offload OOM:

   0%|          | 0/51 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/<REDACTED>/LLaMA_Training/03_train_llama2.py", line 444, in <module>
    main()
  File "/home/<REDACTED>/LLaMA_Training/03_train_llama2.py", line 410, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/transformers/trainer.py", line 1532, in train
    return inner_training_loop(
  File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/transformers/trainer.py", line 1805, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/transformers/trainer.py", line 2648, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/transformers/trainer.py", line 2673, in compute_loss
    outputs = model(**inputs)
  File "/software/all/staging/PyTorch/1.12.1-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1769, in forward
    loss = self.module(*inputs, **kwargs)
  File "/software/all/staging/PyTorch/1.12.1-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 810, in forward
    outputs = self.model(
  File "/software/all/staging/PyTorch/1.12.1-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 698, in forward
    layer_outputs = decoder_layer(
  File "/software/all/staging/PyTorch/1.12.1-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 413, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/software/all/staging/PyTorch/1.12.1-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 322, in forward
    query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
  File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 186, in apply_rotary_pos_emb
    q_embed = (q * cos) + (rotate_half(q) * sin)
RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 31.75 GiB total capacity; 30.62 GiB already allocated; 15.69 MiB free; 30.72 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Any Ideas, hints, approaches, Collab Notebooks you can provide are much appreciated!
I am working on this in context of my master thesis. This project working or failing kind of decides if I get my degree or not.

Reducing the block_size from 1024 to 256 solves this in part.
The loss scale get still reduced, but having small enough dataset will result in successfull completion of the training, because the loss scale does not get reduced fast enough

Did you manage to solve the loss scale issue ?
Reducing the learning rate and increasing the batch size (via gradient_accumulation_steps> 1) can sometimes help.

I suggest you look at Parameter-Efficient fine-tuning: PEFT

It requires much less memory than fine-tuning the whole model.

This issue is also posted in the deepspeed github repository ([BUG] Loss scale already at minimum - Training LlaMA2 7B via HF+deepspeed consistently fails · Issue #4017 · microsoft/DeepSpeed · GitHub).

As suggested there and also written on the huggingface website, using Float16 with models trained on bf16 can result in loss overflow errors.

It is best adviced to continue with bf16 (especially with Llama 2 models).
There exists also a pull request for the deepspeed library that might solve this issue, but it hasn’t been reviewed or merged as of 17.8.2023.