IAM Role Permissions to train Hugging Face model on Amazon Sagemaker

I am looking to fine tune GPT-J using amazon sagemaker in my local environment. I have been following the tutorials and documentation https://huggingface.co/docs/sagemaker/getting-started and here https://huggingface.co/docs/sagemaker/inference#deploy-with-model_data. I have my own training dataset that is stored in S3 but I am running errors due to IAM roles permissions. There is very little documentation covering what actual permissions are required to train Hugging Face training model using Sagemaker.

If anyone knows what IAM role permissions are required to train a Hugging face model that would be great!

You can find the list of all SageMaker API IAM Roles at this link => Amazon SageMaker API Permissions: Actions, Permissions, and Resources Reference - Amazon SageMaker

The one most relevant to your use case is the CreateTrainingJob API (CreateTrainingJob - Amazon SageMaker) that requires the following permissions:

  • sagemaker:CreateTrainingJob
  • iam:PassRole
  • kms:CreateGrant (required only if the associated ResourceConfig has a specified VolumeKmsKeyId and the associated role does not have a policy that permits this action)

To allow the Training Job access data in the S3 Bucket, the following policy should work,

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                        "s3:GetBucketLocation",
                        "s3:GetObject",
                        "s3:GetObjectVersion",
                        "s3:ListBucket"
                      ],
            "Resource": [
                "arn:aws:s3:::YOUR-BUCKET",
                "arn:aws:s3:::YOUR-BUCKET/*"
            ]
        }
    ]
}

To further restrict the S3 Bucket to a particular Training Job, it is possible to specify the exact principal who has access, as so:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:sagemaker:region:account-id:training-job/trainingJobName"
            },
            "Action": [
                        "s3:GetBucketLocation",
                        "s3:GetObject",
                        "s3:GetObjectVersion",
                        "s3:ListBucket"
                      ],
            "Resource": [
                "arn:aws:s3:::YOUR-BUCKET",
                "arn:aws:s3:::YOUR-BUCKET/*"
            ]
        }
    ]
}