Hi,
I am trying to fine-tune layoutLM using with the following:
distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}
estimator = HuggingFace(
entry_point = 'train.py',
py_version = 'py36',
transformers_version='4.4.2',
pytorch_version='1.6.0',
role = role,
instance_type='ml.p3.16xlarge',
instance_count=1,
checkpoint_s3_uri=checkpoint_dir,
checkpoint_local_path='/opt/ml/checkpoints',
hyperparameters = {'epochs': 3,
'batch-size': 16,
'learning-rate': 5e-5,
'use-cuda': True,
'model-name':'microsoft/layoutlm-base-uncased'
},
debugger_hook_config=False,
volume_size = 40,
distribution = distribution,
source_dir = source_dir)
estimator.fit({'input_data_dir': data_uri}, wait = True)
Relevant code in train.py file:
model = LayoutLMForTokenClassification.from_pretrained('microsoft/layoutlm-base-uncased',num_labels = len(labels))
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=4,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
warmup_ratio=0.1,
weight_decay=0.01,
report_to='wandb',
run_name = 'test_run',
logging_steps = 500,
fp16 = True,
load_best_model_at_end = True,
evaluation_strategy = 'steps',
gradient_accumulation_steps = 1,
save_steps = 500,
save_total_limit = 5,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
data_collator = data_collator,
compute_metrics=compute_metrics,
callbacks = [EarlyStoppingCallback]
)
trainer.train()
Unfortunately I keep getting the following error. Tried tracking down the problem but cant seem to figure it out.
[1,7]<stdout>:Traceback (most recent call last):
[1,7]<stdout>: File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[1,7]<stdout>: "__main__", mod_spec)
[1,7]<stdout>: File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,7]<stdout>: exec(code, run_globals)
[1,7]<stdout>: File "/opt/conda/lib/python3.6/site-packages/mpi4py/__main__.py", line 7, in <module>
[1,7]<stdout>: main()
[1,7]<stdout>: File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 196, in main
[1,7]<stdout>: run_command_line(args)
[1,7]<stdout>: File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 47, in run_command_line
[1,7]<stdout>: run_path(sys.argv[0], run_name='__main__')
[1,7]<stdout>: File "/opt/conda/lib/python3.6/runpy.py", line 263, in run_path
[1,7]<stdout>: pkg_name=pkg_name, script_name=fname)
[1,7]<stdout>: File "/opt/conda/lib/python3.6/runpy.py", line 96, in _run_module_code
[1,7]<stdout>: mod_name, mod_spec, pkg_name, script_name)
[1,7]<stdout>: File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,7]<stdout>: exec(code, run_globals)
[1,7]<stdout>: File "train.py", line 619, in <module>
[1,7]<stdout>: trainer.train()
[1,7]<stdout>: File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1050, in train
[1,7]<stdout>: with model.no_sync():
[1,7]<stdout>: File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 800, in __getattr__
[1,7]<stdout>: type(self).__name__, name))
[1,7]<stdout>:torch.nn.modules.module.ModuleAttributeError: 'DistributedDataParallel' object has no attribute 'no_sync'
Any help would be appreciated!