I’m trying to fine tune facebook/esm2_t33_650M_UR50D · Hugging Face (and eventually a larger one) on SageMaker, using the Huggingface estimator, to make model parallel easier. I ran into the “unused parameter” problem and so added the appropriate argument to my hyper parameters.
hyperparameters = {
"ddp_find_unused_parameters": True,
...
}
...
huggingface_estimator = HuggingFace(
entry_point='train_huggingface.py',
source_dir='./',
instance_type='ml.p3.16xlarge',
instance_count=1,
role=role,
transformers_version='4.26.0',
pytorch_version='1.13.1',
py_version='py39',
distribution=distribution_parameters,
hyperparameters=hyperparameters
)
the train_huggingface.py
script is just a stripped down version of the example run_mlm.py
script included in the transformers examples. Also note I am not using gradient checkpointing; at least, I have not set any options to use it. Since I’m just starting to get familiar with this model, I’m training on a subset of sequences, using standard MLM and the DataCollatorForLanguageModeling
.
I ran it again, and got the same error:
[1,mpirank:0,algo-1]<stderr>:RuntimeError: Expected to have finished reduction in the prior iteration before
[1,mpirank:0,algo-1]<stderr>:starting a new one. This error indicates that your module has parameters that
[1,mpirank:0,algo-1]<stderr>:were not used in producing loss. You can enable unused parameter detection by
[1,mpirank:0,algo-1]<stderr>:passing the keyword argument `find_unused_parameters=True` to
[1,mpirank:0,algo-1]<stderr>:`torch.nn.parallel.DistributedDataParallel`, and by
[1,mpirank:0,algo-1]<stderr>:making sure all `forward` function outputs participate in calculating loss.
[1,mpirank:0,algo-1]<stderr>:If you already have done the above, then the distributed data parallel module
[1,mpirank:0,algo-1]<stderr>:wasn't able to locate the output tensors in the return value of your module's
[1,mpirank:0,algo-1]<stderr>:`forward` function. Please include the loss function and the structure of the
[1,mpirank:0,algo-1]<stderr>:return value of `forward` of your module when reporting this issue (e.g. list,
[1,mpirank:0,algo-1]<stderr>:dict, iterable).
[1,mpirank:0,algo-1]<stderr>:Parameter indices which did not receive grad for rank 0: 1 132 133
[1,mpirank:0,algo-1]<stderr>: In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to
[1,mpirank:0,algo-1]<stderr>:either INFO or DETAIL to print out information about which particular parameters
[1,mpirank:0,algo-1]<stderr>:did not receive gradient on this rank [1,mpirank:0,algo-1]<stderr>:as part of this error
--------------------------------------------------------------------------
When I’ve tried this on EC2 (p3.8xlarge) with the same version of transformers
and pytorch
(4.26, 1.13.1) I haven’t run into this problem, and in the past when I’ve used a similar argument in pytorch-lightning, this issue resolved right away. Here, it doesn’t seem to be doing anything.
Any idea what to do? Are there other options I need to be passing for this to work?