Find_unused_parameters parameter to Huggingface SM Estimator not doing anything?

mdc713 · July 7, 2023, 6:03am

I’m trying to fine tune facebook/esm2_t33_650M_UR50D · Hugging Face (and eventually a larger one) on SageMaker, using the Huggingface estimator, to make model parallel easier. I ran into the “unused parameter” problem and so added the appropriate argument to my hyper parameters.

hyperparameters = {
    "ddp_find_unused_parameters": True,
    ...
}
...
huggingface_estimator = HuggingFace(
	entry_point='train_huggingface.py',
	source_dir='./',
	instance_type='ml.p3.16xlarge',
	instance_count=1,
	role=role,
	transformers_version='4.26.0',
	pytorch_version='1.13.1',
	py_version='py39',
        distribution=distribution_parameters,
	hyperparameters=hyperparameters
)

the train_huggingface.py script is just a stripped down version of the example run_mlm.py script included in the transformers examples. Also note I am not using gradient checkpointing; at least, I have not set any options to use it. Since I’m just starting to get familiar with this model, I’m training on a subset of sequences, using standard MLM and the DataCollatorForLanguageModeling.

I ran it again, and got the same error:

[1,mpirank:0,algo-1]<stderr>:RuntimeError: Expected to have finished reduction in the prior iteration before 
[1,mpirank:0,algo-1]<stderr>:starting a new one. This error indicates that your module has parameters that 
[1,mpirank:0,algo-1]<stderr>:were not used in producing loss. You can enable unused parameter detection by 
[1,mpirank:0,algo-1]<stderr>:passing the keyword argument `find_unused_parameters=True` to 
[1,mpirank:0,algo-1]<stderr>:`torch.nn.parallel.DistributedDataParallel`, and by 
[1,mpirank:0,algo-1]<stderr>:making sure all `forward` function outputs participate in calculating loss. 
[1,mpirank:0,algo-1]<stderr>:If you already have done the above, then the distributed data parallel module 
[1,mpirank:0,algo-1]<stderr>:wasn't able to locate the output tensors in the return value of your module's 
[1,mpirank:0,algo-1]<stderr>:`forward` function. Please include the loss function and the structure of the 
[1,mpirank:0,algo-1]<stderr>:return value of `forward` of your module when reporting this issue (e.g. list, 
[1,mpirank:0,algo-1]<stderr>:dict, iterable).
[1,mpirank:0,algo-1]<stderr>:Parameter indices which did not receive grad for rank 0: 1 132 133
[1,mpirank:0,algo-1]<stderr>: In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to 
[1,mpirank:0,algo-1]<stderr>:either INFO or DETAIL to print out information about which particular parameters
[1,mpirank:0,algo-1]<stderr>:did not receive gradient on this rank [1,mpirank:0,algo-1]<stderr>:as part of this error
--------------------------------------------------------------------------

When I’ve tried this on EC2 (p3.8xlarge) with the same version of transformers and pytorch (4.26, 1.13.1) I haven’t run into this problem, and in the past when I’ve used a similar argument in pytorch-lightning, this issue resolved right away. Here, it doesn’t seem to be doing anything.

Any idea what to do? Are there other options I need to be passing for this to work?

jahmanson · July 12, 2023, 5:54pm

I’m having the same issue (except with XLNet and the DataCollatorForPermutationLanguageModeling). Still haven’t resolved it. Any progress on your end?

mdc713 · July 12, 2023, 7:05pm

No, no progress. It turns out the find unused params argument isn’t passed to the model constructor for SageMaker models. I filed an issue for that.

jahmanson · July 14, 2023, 8:44pm

Great! Hopefully they can turn it around quickly.

raghavm1 · June 24, 2024, 1:23am

I’m stuck here too, anyone able to resolve this?

Topic		Replies	Views
Predict function ignore parameters Amazon SageMaker	8	1169	January 28, 2022
HuggingFace Model hyperparameter search with ray as backend not saving best trial hyperparameters Amazon SageMaker	0	286	January 8, 2024
XLNet pre-training fails with multiple GPUs on Sagemaker 🤗Transformers	0	248	July 9, 2023
Running out of memory with all except the basic GPT2 and GPT neo models on sagemaker127M Beginners	0	248	March 31, 2023
Huggingface / Pytorch versions on Sagemaker Amazon SageMaker	9	4378	December 20, 2022

Find_unused_parameters parameter to Huggingface SM Estimator not doing anything?

Related topics