Hi Huggingface team,
I would like to enable distributed training with transformers and Sagemaker (as the GPU platform). However I am seeing this error (even after setting max_grad_norm=0):
Failure reason
AlgorithmError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "raise ValueError( ValueError: SageMaker Model Parallelism in mixed precision mode does not support gradient clipping yet. Pass along 'max_grad_norm': 0 in your hyperparameters. -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. mpirun.real detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was Process name: [[41123,1],0] Exit code: 1" Command "mpirun --host algo-1:4 -np 4 --allow-run-as-root --display-map --tag-output -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1 -mca btl_vader_single_copy_mechanism none -x NCCL_MIN_NRINGS=4 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD
I can see the error comes from this line from transfomers library transformers/src/transformers/trainer.py at main · huggingface/transformers · GitHub.
However, it is not mentioned anywhere in the page of " Run training on Amazon SageMaker" Run training on Amazon SageMaker.
Here are how I pass hyperparameters:
hyperparameters = {
"epochs": 2,
"train_batch_size": 32,
"eval_batch_size": 32,
"learning_rate": 3e-5,
"fp16": True,
"dataset_channel": get_channel_name(), # default is train
"mini_size": 1000,
"model_ckpt": "distilbert-base-uncased",
"max_grad_norm": 0,
}
mpi_options = {"enabled": True, "processes_per_host": 4}
smp_options = {
"enabled": True,
"parameters": {
"microbatches": 4,
"placement_strategy": "spread",
"pipeline": "interleaved",
"optimize": "speed",
"partitions": 4,
"ddp": True,
},
}
distribution = {"smdistributed": {"modelparallel": smp_options}, "mpi": mpi_options}
estimator_config = {
"instance_type": "ml.p3.8xlarge",
"instance_count": 1,
"use_spot_instances": True,
"max_wait": 36000,
"max_run": 10000,
"metric_definitions": [
{"Name": "train_runtime", "Regex": "'train_runtime': ([0-9]+(.)[0-9]+),?"},
{"Name": "eval_accuracy", "Regex": "'eval_accuracy': ([0-9]+(.)[0-9]+),?"},
{"Name": "eval_loss", "Regex": "'eval_loss': ([0-9]+(.)[0-9]+),?"},
],
}
huggingface_estimator = HuggingFace(
entry_point="train_model.py",
source_dir="src",
base_job_name=job_name,
checkpoint_s3_uri=f"{model_accessor.s3_path_parent}/checkpoints",
role=client.execution_role_arn,
transformers_version="4.17",
pytorch_version="1.10",
py_version="py38",
instance_type=estimator_config["instance_type"],
instance_count=estimator_config["instance_count"],
use_spot_instances=estimator_config["use_spot_instances"],
max_wait=estimator_config["max_wait"],
max_run=estimator_config["max_run"],
metric_definitions=estimator_config["metric_definitions"],
hyperparameters=hyperparameters,
distribution=distribution,
)
Would you please advise how should I debug here? Thanks.