Overflow when using DeepSpeed for GPT-J (training aborts)

Eichhof · August 30, 2022, 9:32pm

Hello

I’m fine-tuning EleutherAI/gpt-j-6B on a conversational dataset (TriviaQA with 80’000 rows) using DeepSpeed on a 24 GB GPU and 150 GB of RAM. The DeepSpeed config file and the HF TrainingArguments are shown below.

During training I’m getting often the message OVERFLOW! Rank 0 Skipping step.. The loss scale is then reduced. After short time training then also aborts with Current loss scale already at minimum - cannot decrease scale anymore. Exiting run. When I lock at the loss it is also wiggling around.

Is there an error in my DeepSpeed config or the TrainingArguments or how should I change the hyperparameters? Hyperparameter tuning is pretty difficult due to the large model size and long runtime.

Second, I’m also getting the warning FP16 params for CPUAdam may not work on AMD CPUs. I’m using an AMD CPU and FP16. Is this a problem?

    training_args = TrainingArguments(
        disable_tqdm=True,
        output_dir='/checkpoints',
        save_total_limit=5,
        logging_dir='/logs',
        num_train_epochs=3,
        evaluation_strategy='epoch',
        save_strategy='epoch',
        logging_steps=10,
        overwrite_output_dir=True,
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        gradient_accumulation_steps=4,
        eval_accumulation_steps=4,
        gradient_checkpointing=True,
        max_grad_norm=0.5,
        lr_scheduler_type="constant_with_warmup",
        learning_rate=7e-5,
        warmup_ratio=0.05,
        weight_decay=0.1,
        fp16_full_eval=True,
        fp16=True,
        fp16_opt_level='O1',
        deepspeed="configs/ds_config.json",
        report_to=['tensorboard']
    )

configs/ds_config.json:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1,
        "opt_level": "O1"
    },

    "zero_optimization": {
        "stage": 2,

        "offload_optimizer": {
            "device": "cpu"
        },

        "allgather_partitions": true,
        "allgather_bucket_size":  2e8 ,
        "reduce_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,

        "contiguous_gradients": true,

        "sub_group_size": 1e7,
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e7,
        "stage3_max_reuse_distance": 1e7,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

  "activation_checkpointing": {
      "partitioned_activations":true,
      "number_checkpoints": 100,
      "contiguous_memory_optimization": true,
      "cpu_checkpointing": true,
      "profile": true,
      "synchronize_checkpoint_boundary": true
    },

    "tensorboard": {
      "enabled": true,
      "output_path": "/cluster/scratch/wrafael/cache/models/logs",
      "job_name": "finetune_gpt_j_6b"
    },

    "steps_per_print": 100,
    "zero_allow_untested_optimizer": true,
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false,
    "memory_breakdown": false
}

Boman · December 5, 2022, 1:21am

Hi Eichhof,
I’ve met the same problem as yours, yet I cannot figure it out.
I just found a suggestion from @myleott and heading to try it.
And if you have fixed this problem in another way, I sincerely look forward to your sharing and help.

maveriq · February 14, 2023, 8:44pm

Hi. I am running into similar issue. Did anyone find a reason? @Boman

ticoneva · March 8, 2023, 8:47am

I believe you have to fine-tune it in bf16 instead of fp16.

In TrainingArguments:

 training_args = TrainingArguments(
    ...
    bf16 = True,
    ...
    )

In DeepSpeed config:

{
    ...
    "bf16": { "enabled": auto }
    ...
}

kastan · March 9, 2023, 5:19pm

See also this comment about root cause, implications and 4 next steps: FloatingPointError: Minimum loss scale reached (0.0001). · Issue #1529 · facebookresearch/fairseq · GitHub

Topic		Replies	Views
Issues with using DeepSpeed on multiple GPUs DeepSpeed	2	2531	September 9, 2022
Question about using trainer with DeepSpeed 🤗Transformers	0	451	April 25, 2023
[Solved] Cannot restart training from deepspeed checkpoint Intermediate	3	2679	December 28, 2023
Error using deepspeed for sftconfig DeepSpeed	1	28	April 21, 2025
[Deepspeed] ZeRO-Infinity integration released and config changes DeepSpeed	2	2295	April 28, 2021

Overflow when using DeepSpeed for GPT-J (training aborts)

Related topics