Hello
I’m using deepspeed to train big models on two GPUs (that cannot fit on one).
I’m using the hugging face Trainer with deepspeed params like so :
trainer = Trainer(
model=model,
train_dataset=CustomDataset(...),
eval_dataset=CustomDataset(...),
tokenizer=tokenizer,
compute_metrics=custom_metric,
data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
args=TrainingArguments(
output_dir=f"output_{backbone.replace('/', '_')}",
do_train=True,
do_eval=True,
do_predict=True,
evaluation_strategy="steps",
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
num_train_epochs=5,
fp16=True,
deepspeed={
"fp16": {
"enabled": True,
"min_loss_scale": 1,
"opt_level": "O3",
},
"wandb": {
"enabled": True,
"team": "my_team",
"project": "my_project"
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto",
},
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto",
},
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {"device": "cpu"},
"offload_param": {"device": "cpu"},
"allgather_partitions": True,
"allgather_bucket_size": 5e8,
"contiguous_gradients": True,
},
"train_batch_size": 4,
"train_micro_batch_size_per_gpu": 2,
},
report_to="wandb",
),
)
The first training went fine and I have folders checkpoint with :
global_step29500
|--zero_pp_rank_0_mp_rank_00_model_states.pt
|--zero_pp_rank_0_mp_rank_00_optim_states.pt
special_tokens_map.json
trainer_state.json
zero_to_fp32.py
config.json
merges.txt
tokenizer.json
training_args.bin
latest
rng_state_0.pth
tokenizer_config.json
vocab.json
However when I try to resume_from_checkpoint
(in trainer.train()
) it says :
raise ValueError(f"Can't find a valid checkpoint")
So I try to use the zero_to_fp32.py
file to create a latest.ckpt but without success.
The doc of deepspeed says something about “automatically loading weights” but I’m sceptic.
Thanks in advance,
Have a great day.