'DistributedDataParallel' object has no attribute 'no_sync'

Hi,
I am trying to fine-tune layoutLM using with the following:

distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}

estimator = HuggingFace(
    entry_point = 'train.py',
    py_version = 'py36',
    transformers_version='4.4.2',
    pytorch_version='1.6.0', 
    role = role,
    instance_type='ml.p3.16xlarge',
    instance_count=1,
    checkpoint_s3_uri=checkpoint_dir,
    checkpoint_local_path='/opt/ml/checkpoints',
    hyperparameters = {'epochs': 3, 
                       'batch-size': 16, 
                       'learning-rate': 5e-5, 
                       'use-cuda': True, 
                       'model-name':'microsoft/layoutlm-base-uncased'
                      },
    debugger_hook_config=False,
    volume_size = 40,
    distribution = distribution,
    source_dir = source_dir)

estimator.fit({'input_data_dir': data_uri}, wait = True)

Relevant code in train.py file:

model = LayoutLMForTokenClassification.from_pretrained('microsoft/layoutlm-base-uncased',num_labels = len(labels))

training_args = TrainingArguments(
    output_dir='./results',          
    num_train_epochs=4,              
    per_device_train_batch_size=16,  
    per_device_eval_batch_size=32,   
    warmup_ratio=0.1,               
    weight_decay=0.01,               
    report_to='wandb',
    run_name = 'test_run',
    logging_steps = 500,
    fp16 = True,
    load_best_model_at_end = True,
    evaluation_strategy = 'steps',
    gradient_accumulation_steps = 1,
    save_steps = 500,
    save_total_limit = 5,
)

trainer = Trainer(
    model=model,                         
    args=training_args,                  
    train_dataset=train_dataset,        
    eval_dataset=val_dataset,          
    data_collator = data_collator,
    compute_metrics=compute_metrics,
    callbacks = [EarlyStoppingCallback]
)

trainer.train()

Unfortunately I keep getting the following error. Tried tracking down the problem but cant seem to figure it out.

[1,7]<stdout>:Traceback (most recent call last):
[1,7]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[1,7]<stdout>:    "__main__", mod_spec)
[1,7]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,7]<stdout>:    exec(code, run_globals)
[1,7]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/__main__.py", line 7, in <module>
[1,7]<stdout>:    main()
[1,7]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 196, in main
[1,7]<stdout>:    run_command_line(args)
[1,7]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 47, in run_command_line
[1,7]<stdout>:    run_path(sys.argv[0], run_name='__main__')
[1,7]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 263, in run_path
[1,7]<stdout>:    pkg_name=pkg_name, script_name=fname)
[1,7]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 96, in _run_module_code
[1,7]<stdout>:    mod_name, mod_spec, pkg_name, script_name)
[1,7]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,7]<stdout>:    exec(code, run_globals)
[1,7]<stdout>:  File "train.py", line 619, in <module>
[1,7]<stdout>:    trainer.train()
[1,7]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1050, in train
[1,7]<stdout>:    with model.no_sync():
[1,7]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 800, in __getattr__
[1,7]<stdout>:    type(self).__name__, name))
[1,7]<stdout>:torch.nn.modules.module.ModuleAttributeError: 'DistributedDataParallel' object has no attribute 'no_sync'

Any help would be appreciated!

Oh and running the same code without the ddp and using a 1 GPU instance works just fine but obviously takes much longer to complete

Hey @efinkel88,

thanks for creating the topic. Could you upload your complete train.py? This would help to reproduce the error.

ugh it just started working with no changes to my code and I have no idea why. :man_shrugging:

Could it be possible that you had gradient_accumulation_steps>1? Or are you installing transformers from git master branch?

I have the same issue when I use multi-host training (2 multigpu instances) and set up gradient_accumulation_steps to 10.

I don’t install transformers separately, just use the one that goes with Sagemaker.

I wonder, if gradient_accumulation_steps is not compatible with multi-host training at all, or there are other parameters I need to tweak?

Hey @Ishitori,

which transformers_version are you using?

Hi @philschmid,

I was using the default version published in AWS Sagemaker. I have switched to 4.6.1 version, and the problem is gone.

Thanks for a hint!

1 Like