Using hugging face trainer, I am training models in Colab notebook with no problems. I now need to use deepspeed since I’m running out of memory. Deepspeed was installed without any problems using pip install deepspeed (Torch 1.13) already installed. When I run !ds_report in the notebook, all looks good.
However, when I add deepspeed=ds_config_dict to the end of my TrainingArguments, it crashes with the following:
— START crash details —
/usr/local/lib/python3.8/dist-packages/transformers/deepspeed.py in init(self, config_file_or_dict)
65 dep_version_check(“accelerate”)
66 dep_version_check(“deepspeed”)
—> 67 super().init(config_file_or_dict)
68
TypeError: object.init() takes exactly one argument (the instance to initialize)
— END crash details —
I’ve tried lots of combinations i.e. different config dicts and config stored in a json file on disk. I also looked for solutions online but I haven’t come across the problem. Any help appreciated. Thanks.
The following is the cell from my notebook.
ds_config_dict = {
“zero_optimization”: {
“stage”: 2,
“offload_optimizer”: {
“device”: “cpu”,
“pin_memory”: True
},
“allgather_partitions”: True,
“allgather_bucket_size”: 2e8,
“reduce_scatter”: True,
“reduce_bucket_size”: 2e8,
“overlap_comm”: True,
“contiguous_gradients”: True
}
}
BS = 10
GRAD_ACC = 2
LR = 5e-5
WD = 0.01
WARMUP = 0.1
N_EPOCHS = 5
model_name = model_checkpoint.split(“/”)[-1]
!echo $ds_config_dict
args = TrainingArguments(
f"{model_name}-finetuned-{source_lang}-to-{target_lang}",
evaluation_strategy = “epoch”,
logging_strategy = “epoch”,
save_strategy = “epoch”,
learning_rate=LR,
per_device_train_batch_size=BS,
per_device_eval_batch_size=BS,
num_train_epochs=N_EPOCHS,
weight_decay=WD,
report_to=‘wandb’,
gradient_accumulation_steps=GRAD_ACC,
warmup_ratio=WARMUP,
fp16 = True,
deepspeed=ds_config_dict
)