Find_unused_parameters parameter to Huggingface SM Estimator not doing anything?

I’m trying to fine tune facebook/esm2_t33_650M_UR50D · Hugging Face (and eventually a larger one) on SageMaker, using the Huggingface estimator, to make model parallel easier. I ran into the “unused parameter” problem and so added the appropriate argument to my hyper parameters.

hyperparameters = {
    "ddp_find_unused_parameters": True,
    ...
}
...
huggingface_estimator = HuggingFace(
	entry_point='train_huggingface.py',
	source_dir='./',
	instance_type='ml.p3.16xlarge',
	instance_count=1,
	role=role,
	transformers_version='4.26.0',
	pytorch_version='1.13.1',
	py_version='py39',
        distribution=distribution_parameters,
	hyperparameters=hyperparameters
)

the train_huggingface.py script is just a stripped down version of the example run_mlm.py script included in the transformers examples. Also note I am not using gradient checkpointing; at least, I have not set any options to use it. Since I’m just starting to get familiar with this model, I’m training on a subset of sequences, using standard MLM and the DataCollatorForLanguageModeling.

I ran it again, and got the same error:

[1,mpirank:0,algo-1]<stderr>:RuntimeError: Expected to have finished reduction in the prior iteration before 
[1,mpirank:0,algo-1]<stderr>:starting a new one. This error indicates that your module has parameters that 
[1,mpirank:0,algo-1]<stderr>:were not used in producing loss. You can enable unused parameter detection by 
[1,mpirank:0,algo-1]<stderr>:passing the keyword argument `find_unused_parameters=True` to 
[1,mpirank:0,algo-1]<stderr>:`torch.nn.parallel.DistributedDataParallel`, and by 
[1,mpirank:0,algo-1]<stderr>:making sure all `forward` function outputs participate in calculating loss. 
[1,mpirank:0,algo-1]<stderr>:If you already have done the above, then the distributed data parallel module 
[1,mpirank:0,algo-1]<stderr>:wasn't able to locate the output tensors in the return value of your module's 
[1,mpirank:0,algo-1]<stderr>:`forward` function. Please include the loss function and the structure of the 
[1,mpirank:0,algo-1]<stderr>:return value of `forward` of your module when reporting this issue (e.g. list, 
[1,mpirank:0,algo-1]<stderr>:dict, iterable).
[1,mpirank:0,algo-1]<stderr>:Parameter indices which did not receive grad for rank 0: 1 132 133
[1,mpirank:0,algo-1]<stderr>: In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to 
[1,mpirank:0,algo-1]<stderr>:either INFO or DETAIL to print out information about which particular parameters
[1,mpirank:0,algo-1]<stderr>:did not receive gradient on this rank [1,mpirank:0,algo-1]<stderr>:as part of this error
--------------------------------------------------------------------------

When I’ve tried this on EC2 (p3.8xlarge) with the same version of transformers and pytorch (4.26, 1.13.1) I haven’t run into this problem, and in the past when I’ve used a similar argument in pytorch-lightning, this issue resolved right away. Here, it doesn’t seem to be doing anything.

Any idea what to do? Are there other options I need to be passing for this to work?

I’m having the same issue (except with XLNet and the DataCollatorForPermutationLanguageModeling). Still haven’t resolved it. Any progress on your end?

No, no progress. It turns out the find unused params argument isn’t passed to the model constructor for SageMaker models. I filed an issue for that.

Great! Hopefully they can turn it around quickly.

I’m stuck here too, anyone able to resolve this?