Unexpected .item on an int when using accelerate HF trainer with multiple GPUs only, how to fix?

Weird int being attempted to .item coming from HF trainer.

I got the error

81 Traceback (most recent call last):
 82   File "/lfs/skampere1/0/brando9/massive-autoformalization-maf/maf-src/af_train/unpaired_pytorch_hf_training.py", line 362, in <module>
 83     main()
 84   File "/lfs/skampere1/0/brando9/massive-autoformalization-maf/maf-src/af_train/unpaired_pytorch_hf_training.py", line 352, in main
 85     train()
 86   File "/lfs/skampere1/0/brando9/massive-autoformalization-maf/maf-src/af_train/unpaired_pytorch_hf_training.py", line 346, in train
 87     trainer.train()
 88   File "/lfs/skampere1/0/brando9/miniconda/envs/maf/lib/python3.10/site-packages/transformers/trainer.py", line 1591, in train
 89     return inner_training_loop(
 90   File "/lfs/skampere1/0/brando9/miniconda/envs/maf/lib/python3.10/site-packages/transformers/trainer.py", line 1984, in _inner_training_loop
 91     self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
 92   File "/lfs/skampere1/0/brando9/miniconda/envs/maf/lib/python3.10/site-packages/transformers/trainer.py", line 2339, in _maybe_log_save_evaluate
 93     self._save_checkpoint(model, trial, metrics=metrics)
 94   File "/lfs/skampere1/0/brando9/miniconda/envs/maf/lib/python3.10/site-packages/transformers/trainer.py", line 2408, in _save_checkpoint
 95     save_fsdp_optimizer(
 96   File "/lfs/skampere1/0/brando9/miniconda/envs/maf/lib/python3.10/site-packages/accelerate/utils/fsdp_utils.py", line 138, in save_fsdp_optimizer
 97     optim_state = FSDP.optim_state_dict(model, optimizer)
 98   File "/lfs/skampere1/0/brando9/miniconda/envs/maf/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1753, in optim_state_dict
 99     return FullyShardedDataParallel._optim_state_dict_impl(
100   File "/lfs/skampere1/0/brando9/miniconda/envs/maf/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1154, in _optim_state_dict_impl
101     return _optim_state_dict(
102   File "/lfs/skampere1/0/brando9/miniconda/envs/maf/lib/python3.10/site-packages/torch/distributed/fsdp/_optim_utils.py", line 1455, in _optim_state_dict
103     _gather_orig_param_state(
104   File "/lfs/skampere1/0/brando9/miniconda/envs/maf/lib/python3.10/site-packages/torch/distributed/fsdp/_optim_utils.py", line 1690, in _gather_orig_param_state
105     gathered_state = _all_gather_optim_state(fsdp_state, optim_state)
106   File "/lfs/skampere1/0/brando9/miniconda/envs/maf/lib/python3.10/site-packages/torch/distributed/fsdp/_optim_utils.py", line 1637, in _all_gather_optim_state
107     for name, non_tensor_value in object_state.non_tensors.items():
108 AttributeError: 'int' object has no attribute 'items'

when training on accelerate HF with trainer. It only happens with multiple GPUs. Has anyone experienced this or knows how to fix?

We can’t do much without seeing code here :slight_smile: