Deepspeed zero-2 cpu offloading killing process = -9 error

Hi,

I am using deepspeed zero-2 with cpu offloading for finetuning LLM model.
I keep getting error like this without any detail error description.

[2023-10-26 17:54:44,801] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2454240
[2023-10-26 17:54:48,155] [ERROR] [launch.py:321:sigkill_handler] ['/data_new/sjy98/polyglot-ko/data-parallel/deepspeed-venv/bin/python3', '-u', 'deepspeed-trainer.py', '--local_rank=3', '--deepspeed', 'deepspeed_config_2.json'] exits with return code = -9

I found out that error message like this is usually because of memory issues.
The point is that when I reduce my train data size 54000 to 30000, it works fine.
However, somehow I keep getting error when I increase my train data size again.

It is little hard to believe that the length of train data causes memory issue, but is it possible when using deepspeed zero-2 cpu offload?

Below is my deepspeed_config.

{
    "fp16": {
        "enabled": false
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true,
        "cpu_offload": true
    },
    "communication_data_type": "fp32",
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Also, I am using 4 A100 80G GPU for parallel training.
Any suggestion or thoughts would be very helpful. Thanks!

It looks like a OOM problem. Run out of CPU memory. And this issue may work Finetune T5 11B and the process is killed . exits with return code = -9[BUG] · Issue #2946 · microsoft/DeepSpeed · GitHub

1 Like