Hello
I’m fine-tuning EleutherAI/gpt-j-6B on a conversational dataset (TriviaQA with 80’000 rows) using DeepSpeed on a 24 GB GPU and 150 GB of RAM. The DeepSpeed config file and the HF TrainingArguments are shown below.
During training I’m getting often the message OVERFLOW! Rank 0 Skipping step.
. The loss scale is then reduced. After short time training then also aborts with Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
When I lock at the loss it is also wiggling around.
Is there an error in my DeepSpeed config or the TrainingArguments or how should I change the hyperparameters? Hyperparameter tuning is pretty difficult due to the large model size and long runtime.
Second, I’m also getting the warning FP16 params for CPUAdam may not work on AMD CPUs
. I’m using an AMD CPU and FP16. Is this a problem?
training_args = TrainingArguments(
disable_tqdm=True,
output_dir='/checkpoints',
save_total_limit=5,
logging_dir='/logs',
num_train_epochs=3,
evaluation_strategy='epoch',
save_strategy='epoch',
logging_steps=10,
overwrite_output_dir=True,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
gradient_accumulation_steps=4,
eval_accumulation_steps=4,
gradient_checkpointing=True,
max_grad_norm=0.5,
lr_scheduler_type="constant_with_warmup",
learning_rate=7e-5,
warmup_ratio=0.05,
weight_decay=0.1,
fp16_full_eval=True,
fp16=True,
fp16_opt_level='O1',
deepspeed="configs/ds_config.json",
report_to=['tensorboard']
)
configs/ds_config.json:
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1,
"opt_level": "O1"
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu"
},
"allgather_partitions": true,
"allgather_bucket_size": 2e8 ,
"reduce_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"contiguous_gradients": true,
"sub_group_size": 1e7,
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e7,
"stage3_max_reuse_distance": 1e7,
"stage3_gather_16bit_weights_on_model_save": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"activation_checkpointing": {
"partitioned_activations":true,
"number_checkpoints": 100,
"contiguous_memory_optimization": true,
"cpu_checkpointing": true,
"profile": true,
"synchronize_checkpoint_boundary": true
},
"tensorboard": {
"enabled": true,
"output_path": "/cluster/scratch/wrafael/cache/models/logs",
"job_name": "finetune_gpt_j_6b"
},
"steps_per_print": 100,
"zero_allow_untested_optimizer": true,
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false,
"memory_breakdown": false
}