Sagemaker model parallelism- running the model results in Maximum recursion limit

kanandk · July 20, 2023, 4:40pm

sagemaker /04_distributed_training_model_parallelism

I have cutomized run_glue.py to accept my custom data. I get the following error:

File “/opt/conda/lib/python3.9/site-packages/smdistributed/modelparallel/torch/worker.py”, line 309, in thread_execute_tracing
[1,mpirank:0,algo-1]: self._exec_trace_on_device(req, device)
[1,mpirank:0,algo-1]: File “/opt/conda/lib/python3.9/site-packages/smdistributed/modelparallel/torch/worker.py”, line 268, in _exec_trace_on_device
[1,mpirank:0,algo-1]: outputs = step_fn(*args, **kwargs)
[1,mpirank:0,algo-1]: File “/opt/conda/lib/python3.9/site-packages/transformers/trainer_pt_utils.py”, line 1061, in smp_forward_backward
[1,mpirank:0,algo-1]: outputs = model(**inputs)
[1,mpirank:0,algo-1]: File “/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1194, in _call_impl
[1,mpirank:0,algo-1]: return forward_call(*input, **kwargs)
[1,mpirank:0,algo-1]: File “/opt/conda/lib/python3.9/site-packages/smdistributed/modelparallel/torch/patches/tracing.py”, line 75, in trace_forward
[1,mpirank:0,algo-1]: raise e
…This repeats n-times
With the final error as follows:
RecursionError[1,mpirank:0,algo-1]:: maximum recursion depth exceeded while calling a Python object

@philschmid , I am tagging you since you created this notebook.

philschmid · July 20, 2023, 5:05pm

How did you customize it and how di you start your stop, including hyperparameter?

kanandk · July 21, 2023, 4:48am

@philschmid, The only change I have done is loading my custom data and preparing it. Rest of the “run_glue.py” is as it it is.

 elif data_args.dataset_name is not None:
        # Downloading and loading a dataset from the hub.
        raw_datasets = load_dataset(
            data_args.dataset_name,
            data_args.dataset_config_name,
            cache_dir=model_args.cache_dir,
            use_auth_token=True if model_args.use_auth_token else None,
        )
    else:
        # dataset used
        dataset_name = ............
        # Load and prepare dataset

        data_files = {"train": "user_prompt_train.csv", "test": "user_prompt_test.csv"}
        dataset = load_dataset(
            dataset_name,
            data_files=data_files,
            use_auth_token= ..........
        )
        
        train_dataset = dataset["train"].rename_column("text", "sentence1")
        test_dataset = dataset["test"].rename_column("text", "sentence1")
        
         
        sentence1_key = "sentence1"
        sentence2_key = None
        df = train_dataset.to_pandas()
        labels = df["label"].unique().tolist()
        label2class = {}
        for i, label in enumerate(labels):
            label2class[i] = label
        label2class = '"{}"'.format(json.dumps(label2class))
        logger.info(f"labels: {label2class}")
        ClassLabels = ClassLabel(num_classes=len(labels), names=labels)
      
        
        # tokenizer helper function
        def tokenize(batch):
            tokens = tokenizer(batch[sentence1_key], padding="max_length", truncation=True)
            tokens["labels"] = ClassLabels.str2int(batch["label"])
            return tokens
          

        # tokenize dataset
        train_dataset = train_dataset.map(tokenize, batched=True)
        test_dataset = test_dataset.map(tokenize, batched=True)

        # set format for pytorch
        train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
        test_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
        # also all instances of raw_datasets["train"] have been replaced with train_dataset and raw_datasets["validation"]  with test_dataset.

I have removed the call to the preprocess_data() function as the data is now fully prepared.

Rest of the code exactly the same in run_glue.py that is given as a link from the notebook :

Creating an Estimator and start a training job

In this example we are going to use the run_glue.py from the transformers example scripts. We modified it and included SageMakerTrainer instead of the Trainer to enable model-parallelism. You can find the code here.

I have tried with the SagemakerTrainer as well as the regular Trainer. Same results

In the hyperparameters there are two changes:

hyperparameters={
    'model_name_or_path':model_name,
    #'task_name': 'mnli',
    'per_device_train_batch_size': 16,
    'per_device_eval_batch_size': 16,
    'do_train': True,
    'do_eval': True,
    #'do_predict': True,
    'num_train_epochs': 2,
    'output_dir':'/opt/ml/model',
    'max_steps': 500,
}

philschmid · July 21, 2023, 9:17am

Are you not providing your dataset when creating your job? How should sagemaker have access to user_prompt_train.csv?

kanandk · July 21, 2023, 9:51am

Yes I am.
The dataset name and token has been kept private because I cannot reveal it in a public forum. It loads the data properly
. The csv files have two columns -text, label
dataset = load_dataset(
dataset_name,
data_files=data_files,
use_auth_token= …
)

kanandk · July 21, 2023, 10:00am

@philschmid This exactly the way run_glue.py is doing for other instances such as task_name =“mnli”

if data_args.task_name is not None:
# Downloading and loading a dataset from the hub.
raw_datasets = load_dataset(
“glue”,
data_args.task_name,
cache_dir=model_args.cache_dir,
use_auth_token=True if model_args.use_auth_token else None,
)
I wanted to do a quick experiment so I just hard-coded whatever was needed insteaed of passing via hyperparameters.

kanandk · July 21, 2023, 12:14pm

@philschmid I also forgot to mention that

#from transformers.sagemaker import SageMakerTrainingArguments as TrainingArguments, SageMakerTrainer as Trainer

Does not work as it throws an attribute exception ( distributed_state) and says that it has been deprecated

kanandk · July 24, 2023, 6:29am

@philschmid any pointers for me? Thanks in advance

Topic		Replies	Views
Simple Fairscale Model Parallelization works locally, but using Sagemaker SMP gives me errors Amazon SageMaker	10	2176	June 27, 2022
Error while finding module specification for 'run_glue.py' Amazon SageMaker	7	5246	November 18, 2021
Distributed training with Sagemaker 🤗Transformers	0	305	June 26, 2023
ValidationError: Max token limit(>=1) reached for finetuned models Amazon SageMaker	3	725	December 28, 2023
Some issues when training model on Sagemaker Amazon SageMaker	5	1291	November 24, 2021

Sagemaker model parallelism- running the model results in Maximum recursion limit

Creating an Estimator and start a training job

Related topics