'DistributedDataParallel' object has no attribute 'no_sync'

efinkel88 · April 13, 2021, 4:05pm

Hi,
I am trying to fine-tune layoutLM using with the following:

distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}

estimator = HuggingFace(
    entry_point = 'train.py',
    py_version = 'py36',
    transformers_version='4.4.2',
    pytorch_version='1.6.0', 
    role = role,
    instance_type='ml.p3.16xlarge',
    instance_count=1,
    checkpoint_s3_uri=checkpoint_dir,
    checkpoint_local_path='/opt/ml/checkpoints',
    hyperparameters = {'epochs': 3, 
                       'batch-size': 16, 
                       'learning-rate': 5e-5, 
                       'use-cuda': True, 
                       'model-name':'microsoft/layoutlm-base-uncased'
                      },
    debugger_hook_config=False,
    volume_size = 40,
    distribution = distribution,
    source_dir = source_dir)

estimator.fit({'input_data_dir': data_uri}, wait = True)

Relevant code in train.py file:

model = LayoutLMForTokenClassification.from_pretrained('microsoft/layoutlm-base-uncased',num_labels = len(labels))

training_args = TrainingArguments(
    output_dir='./results',          
    num_train_epochs=4,              
    per_device_train_batch_size=16,  
    per_device_eval_batch_size=32,   
    warmup_ratio=0.1,               
    weight_decay=0.01,               
    report_to='wandb',
    run_name = 'test_run',
    logging_steps = 500,
    fp16 = True,
    load_best_model_at_end = True,
    evaluation_strategy = 'steps',
    gradient_accumulation_steps = 1,
    save_steps = 500,
    save_total_limit = 5,
)

trainer = Trainer(
    model=model,                         
    args=training_args,                  
    train_dataset=train_dataset,        
    eval_dataset=val_dataset,          
    data_collator = data_collator,
    compute_metrics=compute_metrics,
    callbacks = [EarlyStoppingCallback]
)

trainer.train()

Unfortunately I keep getting the following error. Tried tracking down the problem but cant seem to figure it out.

[1,7]<stdout>:Traceback (most recent call last):
[1,7]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[1,7]<stdout>:    "__main__", mod_spec)
[1,7]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,7]<stdout>:    exec(code, run_globals)
[1,7]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/__main__.py", line 7, in <module>
[1,7]<stdout>:    main()
[1,7]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 196, in main
[1,7]<stdout>:    run_command_line(args)
[1,7]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 47, in run_command_line
[1,7]<stdout>:    run_path(sys.argv[0], run_name='__main__')
[1,7]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 263, in run_path
[1,7]<stdout>:    pkg_name=pkg_name, script_name=fname)
[1,7]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 96, in _run_module_code
[1,7]<stdout>:    mod_name, mod_spec, pkg_name, script_name)
[1,7]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,7]<stdout>:    exec(code, run_globals)
[1,7]<stdout>:  File "train.py", line 619, in <module>
[1,7]<stdout>:    trainer.train()
[1,7]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1050, in train
[1,7]<stdout>:    with model.no_sync():
[1,7]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 800, in __getattr__
[1,7]<stdout>:    type(self).__name__, name))
[1,7]<stdout>:torch.nn.modules.module.ModuleAttributeError: 'DistributedDataParallel' object has no attribute 'no_sync'

Any help would be appreciated!

efinkel88 · April 13, 2021, 4:08pm

Oh and running the same code without the ddp and using a 1 GPU instance works just fine but obviously takes much longer to complete

philschmid · April 13, 2021, 4:21pm

Hey @efinkel88,

thanks for creating the topic. Could you upload your complete train.py? This would help to reproduce the error.

efinkel88 · April 13, 2021, 5:53pm

ugh it just started working with no changes to my code and I have no idea why.

philschmid · April 13, 2021, 5:55pm

Could it be possible that you had gradient_accumulation_steps>1? Or are you installing transformers from git master branch?

Ishitori · June 2, 2021, 9:47pm

I have the same issue when I use multi-host training (2 multigpu instances) and set up gradient_accumulation_steps to 10.

I don’t install transformers separately, just use the one that goes with Sagemaker.

I wonder, if gradient_accumulation_steps is not compatible with multi-host training at all, or there are other parameters I need to tweak?

philschmid · June 4, 2021, 11:41am

Hey @Ishitori,

which transformers_version are you using?

Ishitori · June 7, 2021, 9:59pm

Hi @philschmid,

I was using the default version published in AWS Sagemaker. I have switched to 4.6.1 version, and the problem is gone.

Thanks for a hint!

short-mamba · February 8, 2022, 5:31pm

Hey @efinkel88. I was wondering if you can share the train.py file. I am also using the LayoutLM for doc classification. I wanted to train it on multi gpus using the huggingface trainer API. I have all the features extracted and saved in the disk. But I am not quite sure on how to pass the train dataset to the trainer API.
Thanks

Topic		Replies	Views
Distributed Training on Sagemaker Amazon SageMaker	13	2721	August 5, 2021
Distibuted Data Parallel in SageMaker Amazon SageMaker	0	293	February 5, 2024
Input data for LayoutLMv3 on Sagemaker Amazon SageMaker	1	644	January 26, 2023
Inferences with DataParallel Beginners	3	4950	March 15, 2024
Multi gpu training 🤗Transformers	3	6013	April 24, 2022

'DistributedDataParallel' object has no attribute 'no_sync'

Related topics