Sagemaker model parallelism- running the model results in Maximum recursion limit

sagemaker /04_distributed_training_model_parallelism

I have cutomized run_glue.py to accept my custom data. I get the following error:

File “/opt/conda/lib/python3.9/site-packages/smdistributed/modelparallel/torch/worker.py”, line 309, in thread_execute_tracing
[1,mpirank:0,algo-1]: self._exec_trace_on_device(req, device)
[1,mpirank:0,algo-1]: File “/opt/conda/lib/python3.9/site-packages/smdistributed/modelparallel/torch/worker.py”, line 268, in _exec_trace_on_device
[1,mpirank:0,algo-1]: outputs = step_fn(*args, **kwargs)
[1,mpirank:0,algo-1]: File “/opt/conda/lib/python3.9/site-packages/transformers/trainer_pt_utils.py”, line 1061, in smp_forward_backward
[1,mpirank:0,algo-1]: outputs = model(**inputs)
[1,mpirank:0,algo-1]: File “/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1194, in _call_impl
[1,mpirank:0,algo-1]: return forward_call(*input, **kwargs)
[1,mpirank:0,algo-1]: File “/opt/conda/lib/python3.9/site-packages/smdistributed/modelparallel/torch/patches/tracing.py”, line 75, in trace_forward
[1,mpirank:0,algo-1]: raise e
…This repeats n-times
With the final error as follows:
RecursionError[1,mpirank:0,algo-1]:: maximum recursion depth exceeded while calling a Python object

@philschmid , I am tagging you since you created this notebook.

How did you customize it and how di you start your stop, including hyperparameter?

@philschmid, The only change I have done is loading my custom data and preparing it. Rest of the “run_glue.py” is as it it is.

 elif data_args.dataset_name is not None:
        # Downloading and loading a dataset from the hub.
        raw_datasets = load_dataset(
            data_args.dataset_name,
            data_args.dataset_config_name,
            cache_dir=model_args.cache_dir,
            use_auth_token=True if model_args.use_auth_token else None,
        )
    else:
        # dataset used
        dataset_name = ............
        # Load and prepare dataset

        data_files = {"train": "user_prompt_train.csv", "test": "user_prompt_test.csv"}
        dataset = load_dataset(
            dataset_name,
            data_files=data_files,
            use_auth_token= ..........
        )
        
        train_dataset = dataset["train"].rename_column("text", "sentence1")
        test_dataset = dataset["test"].rename_column("text", "sentence1")
        
         
        sentence1_key = "sentence1"
        sentence2_key = None
        df = train_dataset.to_pandas()
        labels = df["label"].unique().tolist()
        label2class = {}
        for i, label in enumerate(labels):
            label2class[i] = label
        label2class = '"{}"'.format(json.dumps(label2class))
        logger.info(f"labels: {label2class}")
        ClassLabels = ClassLabel(num_classes=len(labels), names=labels)
      
        
        # tokenizer helper function
        def tokenize(batch):
            tokens = tokenizer(batch[sentence1_key], padding="max_length", truncation=True)
            tokens["labels"] = ClassLabels.str2int(batch["label"])
            return tokens
          

        # tokenize dataset
        train_dataset = train_dataset.map(tokenize, batched=True)
        test_dataset = test_dataset.map(tokenize, batched=True)

        # set format for pytorch
        train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
        test_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
        # also all instances of raw_datasets["train"] have been replaced with train_dataset and raw_datasets["validation"]  with test_dataset.

I have removed the call to the preprocess_data() function as the data is now fully prepared.

Rest of the code exactly the same in run_glue.py that is given as a link from the notebook :

Creating an Estimator and start a training job

In this example we are going to use the run_glue.py from the transformers example scripts. We modified it and included SageMakerTrainer instead of the Trainer to enable model-parallelism. You can find the code here.

I have tried with the SagemakerTrainer as well as the regular Trainer. Same results

In the hyperparameters there are two changes:

hyperparameters={
    'model_name_or_path':model_name,
    #'task_name': 'mnli',
    'per_device_train_batch_size': 16,
    'per_device_eval_batch_size': 16,
    'do_train': True,
    'do_eval': True,
    #'do_predict': True,
    'num_train_epochs': 2,
    'output_dir':'/opt/ml/model',
    'max_steps': 500,
}

Are you not providing your dataset when creating your job? How should sagemaker have access to user_prompt_train.csv?

Yes I am.
The dataset name and token has been kept private because I cannot reveal it in a public forum. It loads the data properly
. The csv files have two columns -text, label
dataset = load_dataset(
dataset_name,
data_files=data_files,
use_auth_token= …
)

@philschmid This exactly the way run_glue.py is doing for other instances such as task_name =“mnli”

if data_args.task_name is not None:
# Downloading and loading a dataset from the hub.
raw_datasets = load_dataset(
“glue”,
data_args.task_name,
cache_dir=model_args.cache_dir,
use_auth_token=True if model_args.use_auth_token else None,
)
I wanted to do a quick experiment so I just hard-coded whatever was needed insteaed of passing via hyperparameters.

@philschmid I also forgot to mention that

#from transformers.sagemaker import SageMakerTrainingArguments as TrainingArguments, SageMakerTrainer as Trainer

Does not work as it throws an attribute exception ( distributed_state) and says that it has been deprecated

@philschmid any pointers for me? Thanks in advance