Weird int being attempted to .item coming from HF trainer.
I got the error
81 Traceback (most recent call last):
82 File "/lfs/skampere1/0/brando9/massive-autoformalization-maf/maf-src/af_train/unpaired_pytorch_hf_training.py", line 362, in <module>
83 main()
84 File "/lfs/skampere1/0/brando9/massive-autoformalization-maf/maf-src/af_train/unpaired_pytorch_hf_training.py", line 352, in main
85 train()
86 File "/lfs/skampere1/0/brando9/massive-autoformalization-maf/maf-src/af_train/unpaired_pytorch_hf_training.py", line 346, in train
87 trainer.train()
88 File "/lfs/skampere1/0/brando9/miniconda/envs/maf/lib/python3.10/site-packages/transformers/trainer.py", line 1591, in train
89 return inner_training_loop(
90 File "/lfs/skampere1/0/brando9/miniconda/envs/maf/lib/python3.10/site-packages/transformers/trainer.py", line 1984, in _inner_training_loop
91 self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
92 File "/lfs/skampere1/0/brando9/miniconda/envs/maf/lib/python3.10/site-packages/transformers/trainer.py", line 2339, in _maybe_log_save_evaluate
93 self._save_checkpoint(model, trial, metrics=metrics)
94 File "/lfs/skampere1/0/brando9/miniconda/envs/maf/lib/python3.10/site-packages/transformers/trainer.py", line 2408, in _save_checkpoint
95 save_fsdp_optimizer(
96 File "/lfs/skampere1/0/brando9/miniconda/envs/maf/lib/python3.10/site-packages/accelerate/utils/fsdp_utils.py", line 138, in save_fsdp_optimizer
97 optim_state = FSDP.optim_state_dict(model, optimizer)
98 File "/lfs/skampere1/0/brando9/miniconda/envs/maf/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1753, in optim_state_dict
99 return FullyShardedDataParallel._optim_state_dict_impl(
100 File "/lfs/skampere1/0/brando9/miniconda/envs/maf/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1154, in _optim_state_dict_impl
101 return _optim_state_dict(
102 File "/lfs/skampere1/0/brando9/miniconda/envs/maf/lib/python3.10/site-packages/torch/distributed/fsdp/_optim_utils.py", line 1455, in _optim_state_dict
103 _gather_orig_param_state(
104 File "/lfs/skampere1/0/brando9/miniconda/envs/maf/lib/python3.10/site-packages/torch/distributed/fsdp/_optim_utils.py", line 1690, in _gather_orig_param_state
105 gathered_state = _all_gather_optim_state(fsdp_state, optim_state)
106 File "/lfs/skampere1/0/brando9/miniconda/envs/maf/lib/python3.10/site-packages/torch/distributed/fsdp/_optim_utils.py", line 1637, in _all_gather_optim_state
107 for name, non_tensor_value in object_state.non_tensors.items():
108 AttributeError: 'int' object has no attribute 'items'
when training on accelerate HF with trainer. It only happens with multiple GPUs. Has anyone experienced this or knows how to fix?