Distributed training with Sagemaker

Hi Huggingface team,

I would like to enable distributed training with transformers and Sagemaker (as the GPU platform). However I am seeing this error (even after setting max_grad_norm=0):

Failure reason
AlgorithmError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "raise ValueError( ValueError: SageMaker Model Parallelism in mixed precision mode does not support gradient clipping yet. Pass along 'max_grad_norm': 0 in your hyperparameters. -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. mpirun.real detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was Process name: [[41123,1],0] Exit code: 1" Command "mpirun --host algo-1:4 -np 4 --allow-run-as-root --display-map --tag-output -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1 -mca btl_vader_single_copy_mechanism none -x NCCL_MIN_NRINGS=4 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD

I can see the error comes from this line from transfomers library transformers/src/transformers/trainer.py at main · huggingface/transformers · GitHub.

However, it is not mentioned anywhere in the page of " Run training on Amazon SageMaker" Run training on Amazon SageMaker.

Here are how I pass hyperparameters:

hyperparameters = {
    "epochs": 2,
    "train_batch_size": 32,
    "eval_batch_size": 32,
    "learning_rate": 3e-5,
    "fp16": True,
    "dataset_channel": get_channel_name(),  # default is train
    "mini_size": 1000,
    "model_ckpt": "distilbert-base-uncased",
    "max_grad_norm": 0,
}

mpi_options = {"enabled": True, "processes_per_host": 4}

smp_options = {
     "enabled": True,
     "parameters": {
         "microbatches": 4,
         "placement_strategy": "spread",
         "pipeline": "interleaved",
         "optimize": "speed",
         "partitions": 4,
         "ddp": True,
     },
 }

distribution = {"smdistributed": {"modelparallel": smp_options}, "mpi": mpi_options}

estimator_config = {
    "instance_type": "ml.p3.8xlarge",
    "instance_count": 1,
    "use_spot_instances": True,
    "max_wait": 36000,
    "max_run": 10000,
    "metric_definitions": [
        {"Name": "train_runtime", "Regex": "'train_runtime': ([0-9]+(.)[0-9]+),?"},
        {"Name": "eval_accuracy", "Regex": "'eval_accuracy': ([0-9]+(.)[0-9]+),?"},
        {"Name": "eval_loss", "Regex": "'eval_loss': ([0-9]+(.)[0-9]+),?"},
    ],
}


huggingface_estimator = HuggingFace(
        entry_point="train_model.py",
        source_dir="src",
        base_job_name=job_name,
        checkpoint_s3_uri=f"{model_accessor.s3_path_parent}/checkpoints",
        role=client.execution_role_arn,
        transformers_version="4.17",
        pytorch_version="1.10",
        py_version="py38",
        instance_type=estimator_config["instance_type"],
        instance_count=estimator_config["instance_count"],
        use_spot_instances=estimator_config["use_spot_instances"],
        max_wait=estimator_config["max_wait"],
        max_run=estimator_config["max_run"],
        metric_definitions=estimator_config["metric_definitions"],
        hyperparameters=hyperparameters,
        distribution=distribution,
    )

Would you please advise how should I debug here? Thanks.