Seq2Seq-Example does not work on Azure

Fl0w · January 5, 2021, 1:25pm

Hi community,

we use transformers to generate summaries (seq2seq) for finance articles. Therefore we use the model: facebook/bart-large-cnn
The generated summaries are pretty good.

In the next step we want to finetune this model. Based on the examples on github we want to run the finetune in the Azure cloud with AzureML.
This is the part where we have problems.

We use the following snippet to run the finetune:

dataset_input = Dataset.File.from_files(path=(datastore, 'datasets/wmt_en_ro'))

config = ScriptRunConfig(source_directory='transformers/examples/seq2seq', script='seq2seq_trainer.py',
                         compute_target='gpu-cluster', arguments=['--learning_rate', 3e-5,
                                                                  '--gpus', 1,
                                                                  '--num_train_epochs', 4,
                                                                  '--data_dir', dataset_input.as_mount(),
                                                                  '--output_dir', 'outputs',
                                                                  '--model_name_or_path', 'facebook/bart-large-cnn'])

# set up pytorch environment
env = Environment.from_pip_requirements(name='transformers-env', file_path='transformers/examples/seq2seq/requirements.txt')

# install local (forked) transformers package
whl_url = Environment.add_private_pip_wheel(workspace=ws, file_path=retrieve_whl_filepath(), exist_ok=True)
env.python.conda_dependencies.add_pip_package(whl_url)

env.python.conda_dependencies.add_pip_package('azureml-sdk')
env.python.conda_dependencies.add_pip_package('torch')
env.python.conda_dependencies.add_pip_package('torchvision')
env.docker.enabled = True

env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.2-cudnn7-ubuntu18.04'
config.run_config.environment = env

run = experiment.submit(config)

In Azure we see the following logs:

[2021-01-05T09:47:48.676635] Entering context manager injector.
[context_manager_injector.py] Command line Options: Namespace(inject=['ProjectPythonPath:context_managers.ProjectPythonPath', 'RunHistory:context_managers.RunHistory', 'TrackUserError:context_managers.TrackUserError'], invocation=['seq2seq_trainer.py', '--learning_rate', '3E-05', '--gpus', '1', '--num_train_epochs', '4', '--data_dir', '/tmp/tmpzuh7cbpc', '--output_dir', 'outputs', '--model_name_or_path', 'facebook/bart-large-cnn'])
Script type = None
Starting the daemon thread to refresh tokens in background for process with pid = 121
Entering Run History Context Manager.
[2021-01-05T09:47:51.702678] Current directory: /mnt/batch/tasks/shared/LS_root/jobs/sumurai-ml/azureml/transformers-example-finetune_1609838617_863b0dcf/mounts/workspaceblobstore/azureml/transformers-example-finetune_1609838617_863b0dcf
[2021-01-05T09:47:51.702779] Preparing to call script [seq2seq_trainer.py] with arguments:['--learning_rate', '3E-05', '--gpus', '1', '--num_train_epochs', '4', '--data_dir', '/tmp/tmpzuh7cbpc', '--output_dir', 'outputs', '--model_name_or_path', 'facebook/bart-large-cnn']
[2021-01-05T09:47:51.702858] After variable expansion, calling script [seq2seq_trainer.py] with arguments:['--learning_rate', '3E-05', '--gpus', '1', '--num_train_epochs', '4', '--data_dir', '/tmp/tmpzuh7cbpc', '--output_dir', 'outputs', '--model_name_or_path', 'facebook/bart-large-cnn']

[2021-01-05T09:47:53.903691] Reloading <module '__main__' from 'seq2seq_trainer.py'> failed: module __main__ not in sys.modules.
Starting the daemon thread to refresh tokens in background for process with pid = 121

[2021-01-05T09:47:54.034286] The experiment completed successfully. Finalizing run...
Cleaning up all outstanding Run operations, waiting 900.0 seconds
1 items cleaning up...
Cleanup took 0.05184674263000488 seconds
[2021-01-05T09:47:54.508286] Finished context manager injector.

The output directory is not created, because of the following error:

[2021-01-05T09:47:53.903691] Reloading <module '__main__' from 'seq2seq_trainer.py'> failed: module __main__ not in sys.modules.

This is our situation. We are clueless and hope for some help. We dont know why this error occours!?
First of all: Is this the right way to run the seq2seq-finetune in the cloud or is there a better way?

Does somebody has any idea why this error occours?

Setting:

transformers 4.1.1
AzureML
Docker-Base-Image: mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.2-cudnn7-ubuntu18.04
Using GPU-Setup

Thanks in advance.

Regards, Florian

valhalla · January 6, 2021, 8:57am

Hi @Fl0w

I’m not familiar with Azure, but from what I can see, I think the script argument expected the path of the training script. seq2seq_trainer.py contains the trainer class, it’s not a training script.

finetune_trainer.py is the training script, so I think passing that might fix this.

Fl0w · January 9, 2021, 7:19am

Perfect, thank you so much, @valhalla ! I feel stupid right now

Topic		Replies	Views
BART finetuning for summarization without seq2seq trainer Beginners	1	818	October 31, 2022
Huge difference in speed when finetuning summarization with different scripts 🤗Transformers	4	890	August 13, 2021
Model trains with Seq2SeqTrainer but gets stuck using Trainer 🤗Transformers	4	1952	August 23, 2021
Problem fine-tuning a model with Seq2Seq Trainer Beginners	1	992	June 25, 2023
Generate 'continuation' for seq2seq models Intermediate	1	1861	February 22, 2021

Seq2Seq-Example does not work on Azure

Related topics