Overflow when using DeepSpeed for GPT-J (training aborts)

Hello

I’m fine-tuning EleutherAI/gpt-j-6B on a conversational dataset (TriviaQA with 80’000 rows) using DeepSpeed on a 24 GB GPU and 150 GB of RAM. The DeepSpeed config file and the HF TrainingArguments are shown below.

During training I’m getting often the message OVERFLOW! Rank 0 Skipping step.. The loss scale is then reduced. After short time training then also aborts with Current loss scale already at minimum - cannot decrease scale anymore. Exiting run. When I lock at the loss it is also wiggling around.

Is there an error in my DeepSpeed config or the TrainingArguments or how should I change the hyperparameters? Hyperparameter tuning is pretty difficult due to the large model size and long runtime.

Second, I’m also getting the warning FP16 params for CPUAdam may not work on AMD CPUs. I’m using an AMD CPU and FP16. Is this a problem?

    training_args = TrainingArguments(
        disable_tqdm=True,
        output_dir='/checkpoints',
        save_total_limit=5,
        logging_dir='/logs',
        num_train_epochs=3,
        evaluation_strategy='epoch',
        save_strategy='epoch',
        logging_steps=10,
        overwrite_output_dir=True,
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        gradient_accumulation_steps=4,
        eval_accumulation_steps=4,
        gradient_checkpointing=True,
        max_grad_norm=0.5,
        lr_scheduler_type="constant_with_warmup",
        learning_rate=7e-5,
        warmup_ratio=0.05,
        weight_decay=0.1,
        fp16_full_eval=True,
        fp16=True,
        fp16_opt_level='O1',
        deepspeed="configs/ds_config.json",
        report_to=['tensorboard']
    )

configs/ds_config.json:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1,
        "opt_level": "O1"
    },

    "zero_optimization": {
        "stage": 2,

        "offload_optimizer": {
            "device": "cpu"
        },

        "allgather_partitions": true,
        "allgather_bucket_size":  2e8 ,
        "reduce_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,

        "contiguous_gradients": true,

        "sub_group_size": 1e7,
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e7,
        "stage3_max_reuse_distance": 1e7,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

  "activation_checkpointing": {
      "partitioned_activations":true,
      "number_checkpoints": 100,
      "contiguous_memory_optimization": true,
      "cpu_checkpointing": true,
      "profile": true,
      "synchronize_checkpoint_boundary": true
    },

    "tensorboard": {
      "enabled": true,
      "output_path": "/cluster/scratch/wrafael/cache/models/logs",
      "job_name": "finetune_gpt_j_6b"
    },

    "steps_per_print": 100,
    "zero_allow_untested_optimizer": true,
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false,
    "memory_breakdown": false
}
1 Like

Hi Eichhof,
I’ve met the same problem as yours, yet I cannot figure it out.
I just found a suggestion from @myleott and heading to try it.
And if you have fixed this problem in another way, I sincerely look forward to your sharing and help.

1 Like

Hi. I am running into similar issue. Did anyone find a reason? @Boman

I believe you have to fine-tune it in bf16 instead of fp16.

In TrainingArguments:

 training_args = TrainingArguments(
    ...
    bf16 = True,
    ...
    )

In DeepSpeed config:

{
    ...
    "bf16": { "enabled": auto }
    ...
}
2 Likes

See also this comment about root cause, implications and 4 next steps: FloatingPointError: Minimum loss scale reached (0.0001). · Issue #1529 · facebookresearch/fairseq · GitHub

1 Like