Seq2Seq-Example does not work on Azure

Hi community,

we use transformers to generate summaries (seq2seq) for finance articles. Therefore we use the model: facebook/bart-large-cnn
The generated summaries are pretty good.

In the next step we want to finetune this model. Based on the examples on github we want to run the finetune in the Azure cloud with AzureML.
This is the part where we have problems.

We use the following snippet to run the finetune:

dataset_input = Dataset.File.from_files(path=(datastore, 'datasets/wmt_en_ro'))

config = ScriptRunConfig(source_directory='transformers/examples/seq2seq', script='seq2seq_trainer.py',
                         compute_target='gpu-cluster', arguments=['--learning_rate', 3e-5,
                                                                  '--gpus', 1,
                                                                  '--num_train_epochs', 4,
                                                                  '--data_dir', dataset_input.as_mount(),
                                                                  '--output_dir', 'outputs',
                                                                  '--model_name_or_path', 'facebook/bart-large-cnn'])

# set up pytorch environment
env = Environment.from_pip_requirements(name='transformers-env', file_path='transformers/examples/seq2seq/requirements.txt')

# install local (forked) transformers package
whl_url = Environment.add_private_pip_wheel(workspace=ws, file_path=retrieve_whl_filepath(), exist_ok=True)
env.python.conda_dependencies.add_pip_package(whl_url)

env.python.conda_dependencies.add_pip_package('azureml-sdk')
env.python.conda_dependencies.add_pip_package('torch')
env.python.conda_dependencies.add_pip_package('torchvision')
env.docker.enabled = True

env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.2-cudnn7-ubuntu18.04'
config.run_config.environment = env

run = experiment.submit(config)

In Azure we see the following logs:

[2021-01-05T09:47:48.676635] Entering context manager injector.
[context_manager_injector.py] Command line Options: Namespace(inject=['ProjectPythonPath:context_managers.ProjectPythonPath', 'RunHistory:context_managers.RunHistory', 'TrackUserError:context_managers.TrackUserError'], invocation=['seq2seq_trainer.py', '--learning_rate', '3E-05', '--gpus', '1', '--num_train_epochs', '4', '--data_dir', '/tmp/tmpzuh7cbpc', '--output_dir', 'outputs', '--model_name_or_path', 'facebook/bart-large-cnn'])
Script type = None
Starting the daemon thread to refresh tokens in background for process with pid = 121
Entering Run History Context Manager.
[2021-01-05T09:47:51.702678] Current directory: /mnt/batch/tasks/shared/LS_root/jobs/sumurai-ml/azureml/transformers-example-finetune_1609838617_863b0dcf/mounts/workspaceblobstore/azureml/transformers-example-finetune_1609838617_863b0dcf
[2021-01-05T09:47:51.702779] Preparing to call script [seq2seq_trainer.py] with arguments:['--learning_rate', '3E-05', '--gpus', '1', '--num_train_epochs', '4', '--data_dir', '/tmp/tmpzuh7cbpc', '--output_dir', 'outputs', '--model_name_or_path', 'facebook/bart-large-cnn']
[2021-01-05T09:47:51.702858] After variable expansion, calling script [seq2seq_trainer.py] with arguments:['--learning_rate', '3E-05', '--gpus', '1', '--num_train_epochs', '4', '--data_dir', '/tmp/tmpzuh7cbpc', '--output_dir', 'outputs', '--model_name_or_path', 'facebook/bart-large-cnn']

[2021-01-05T09:47:53.903691] Reloading <module '__main__' from 'seq2seq_trainer.py'> failed: module __main__ not in sys.modules.
Starting the daemon thread to refresh tokens in background for process with pid = 121

[2021-01-05T09:47:54.034286] The experiment completed successfully. Finalizing run...
Cleaning up all outstanding Run operations, waiting 900.0 seconds
1 items cleaning up...
Cleanup took 0.05184674263000488 seconds
[2021-01-05T09:47:54.508286] Finished context manager injector.

The output directory is not created, because of the following error:

[2021-01-05T09:47:53.903691] Reloading <module '__main__' from 'seq2seq_trainer.py'> failed: module __main__ not in sys.modules.

This is our situation. We are clueless and hope for some help. We dont know why this error occours!?
First of all: Is this the right way to run the seq2seq-finetune in the cloud or is there a better way?

Does somebody has any idea why this error occours?

Setting:

Thanks in advance.

Regards, Florian

Hi @Fl0w

I’m not familiar with Azure, but from what I can see, I think the script argument expected the path of the training script. seq2seq_trainer.py contains the trainer class, it’s not a training script.

finetune_trainer.py is the training script, so I think passing that might fix this.

1 Like

Perfect, thank you so much, @valhalla ! I feel stupid right now :frowning: