Hi,
I am using deepspeed zero-2 with cpu offloading for finetuning LLM model.
I keep getting error like this without any detail error description.
[2023-10-26 17:54:44,801] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2454240
[2023-10-26 17:54:48,155] [ERROR] [launch.py:321:sigkill_handler] ['/data_new/sjy98/polyglot-ko/data-parallel/deepspeed-venv/bin/python3', '-u', 'deepspeed-trainer.py', '--local_rank=3', '--deepspeed', 'deepspeed_config_2.json'] exits with return code = -9
I found out that error message like this is usually because of memory issues.
The point is that when I reduce my train data size 54000 to 30000, it works fine.
However, somehow I keep getting error when I increase my train data size again.
It is little hard to believe that the length of train data causes memory issue, but is it possible when using deepspeed zero-2 cpu offload?
Below is my deepspeed_config.
{
"fp16": {
"enabled": false
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true,
"cpu_offload": true
},
"communication_data_type": "fp32",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
Also, I am using 4 A100 80G GPU for parallel training.
Any suggestion or thoughts would be very helpful. Thanks!