Multi Instance Training Error

Ive been working closely with AWS to solve this issue. They told me to post here. Ive been trying to get multi instance working with AWS Sagemaker x Hugging Face estimators. My code works okay for single instance non distributed training and single instance distributed training. It does not work for multi instance distributed training. I am using the huggingface-pytorch-training:1.7-transformers4.6-gpu-py36-cu110-ubuntu18.04 image. The image is in our internal ECR because we run in a VPC.

Here is the code I am using. Its calling the same from this repo (SageMaker-HuggingFace-Workshop/ at main · C24IO/SageMaker-HuggingFace-Workshop · GitHub). I get a FileNotFoundError error after training when the script is trying to load the model. I must be forgetting to set the correct path somewhere.

import sagemaker
import time
from sagemaker.huggingface import HuggingFace
import logging
import os
from sagemaker.s3 import S3Uploader

role = 'ROLE'
default_bucket = 'BUCKET_NAME'

local_train_dataset = "amazon_us_reviews_apparel_v1_00_train.json"
local_test_dataset = "amazon_us_reviews_apparel_v1_00_test.json"

# s3 uris for datasets
remote_train_dataset = f"s3://{default_bucket}/"
remote_test_dataset = f"s3://{default_bucket}/"

# upload datasets

print(f"train dataset uploaded to: {remote_train_dataset}/{local_train_dataset}")
print(f"test dataset uploaded to: {remote_test_dataset}/{local_test_dataset}")

# hyperparameters, which are passed into the training job
hyperparameters={'epochs': 1,                          # number of training epochs
                 'train_batch_size': 32,               # batch size for training
                 'eval_batch_size': 64,                # batch size for evaluation
                 'learning_rate': 3e-5,                # learning rate used during training
                 'model_id':'distilbert-base-uncased', # pre-trained model
                 'fp16': True,                         # Whether to use 16-bit (mixed) precision training
                 'train_file': local_train_dataset,    # training dataset
                 'test_file': local_test_dataset,      # test dataset

    {'Name': 'eval_loss',               'Regex': "'eval_loss': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_accuracy',           'Regex': "'eval_accuracy': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_f1',                 'Regex': "'eval_f1': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_precision',          'Regex': "'eval_precision': ([0-9]+(.|e\-)[0-9]+),?"}]

# define Training Job Name 
job_name = f'huggingface-workshop-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'

discovery_bucket_kms = 'KMS'
subnets = ['subnet-xxx']
security_group_ids = ['sg-xxx','sg-xxz','sg-xxy','sg-xx10']
distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}

logging.debug('Creating the Estimator')

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = '',      
    source_dir           = 'scripts',     
    instance_type        = 'ml.p3.16xlarge',
    instance_count       = 2,             
    base_job_name        = job_name,        
    role                 = role,            
    transformers_version = '4.6',            
    pytorch_version      = '1.7',             
    py_version           = 'py36',          
    hyperparameters      = hyperparameters,   
    metric_definitions   = metric_definitions,
    distribution = distribution, 

    output_kms_key = discovery_bucket_kms,
    subnets = subnets,
    security_group_ids = security_group_ids,
    enable_network_isolation = True,
    encrypt_inter_container_traffic = True,
    image_uri = 'INTERNAL_ECR_URI'

# define a data input dictonary with our uploaded s3 uris
training_data = {
    'train': remote_train_dataset,
    'test': remote_test_dataset
logging.debug('Running Fit')

Hello @bkuchars,

Thank you for opening this thread!

To confirm it you are running inside a VPC without internet access?
If yes the script cannot work since in the lines 55 and 87 you are loading the tokenizer and model with the .from_pretrained method. Transformers is trying to load the model from Models - Hugging Face and without internet access, this won’t work.

To solve this you would need to upload your model to s3 and then you can provide the third key in training_data with "model" pointing to a s3uri. SageMaker will then also load the model on start up to the runtime and then you can load the model from disk.
Therefore you would need to change the parameters in .from_pretrained() to the local directory where to model is saved. In this case, it should be /opt/ml/input/data/model

Yes, running without Internet access. I apologize, I did change to pass a local directory into from_pretrained(). Below is the exactly used. I have a scripts/ folder which contains and the folder distilbert-base-cased-local which contains all model files.

from transformers import (
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from datasets import load_dataset
import random
import logging
import sys
import argparse
import os
import torch

if __name__ == "__main__":

    parser = argparse.ArgumentParser()

    # hyperparameters sent by the client are passed as command-line arguments to the script.
    parser.add_argument("--epochs", type=int, default=3)
    parser.add_argument("--train_batch_size", type=int, default=32)
    parser.add_argument("--eval_batch_size", type=int, default=64)
    parser.add_argument("--warmup_steps", type=int, default=500)
    parser.add_argument("--model_id", type=str)
    parser.add_argument("--learning_rate", type=str, default=5e-5)
    parser.add_argument("--train_file", type=str, default="amazon_us_reviews_apparel_v1_00_train.json")
    parser.add_argument("--test_file", type=str, default="amazon_us_reviews_apparel_v1_00_test.json")
    parser.add_argument("--fp16", type=bool, default=True)

    # Data, model, and output directories
    parser.add_argument("--output_data_dir", type=str, default=os.environ["SM_OUTPUT_DATA_DIR"])
    parser.add_argument("--model_dir", type=str, default=os.environ["SM_MODEL_DIR"])
    parser.add_argument("--n_gpus", type=str, default=os.environ["SM_NUM_GPUS"])
    parser.add_argument("--training_dir", type=str, default=os.environ["SM_CHANNEL_TRAIN"])
    parser.add_argument("--test_dir", type=str, default=os.environ["SM_CHANNEL_TEST"])

    args, _ = parser.parse_known_args()

    # Set up logging
    logger = logging.getLogger(__name__)

        format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",

    # load datasets
    raw_train_dataset = load_dataset("json", data_files=os.path.join(args.training_dir, args.train_file))["train"]
    raw_test_dataset = load_dataset("json", data_files=os.path.join(args.test_dir, args.test_file))["train"]
    # load tokenizer
    tokenizer = AutoTokenizer.from_pretrained('distilbert-base-cased-local')

    # preprocess function, tokenizes text
    def preprocess_function(examples):
        return tokenizer(examples["review"], padding="max_length", truncation=True)

    # preprocess dataset
    train_dataset =
    test_dataset =

    # define labels
    num_labels = len(train_dataset.unique("label"))

    # print size" loaded train_dataset length is: {len(train_dataset)}")" loaded test_dataset length is: {len(test_dataset)}")

    # compute metrics function for binary classification
    def compute_metrics(pred):
        labels = pred.label_ids
        preds = pred.predictions.argmax(-1)
        precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="micro")
        acc = accuracy_score(labels, preds)
        return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}

    # download model from model hub
    model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-cased-local', num_labels=num_labels)

    # define training args
    training_args = TrainingArguments(

    # create Trainer instance
    trainer = Trainer(

    # train model

    # evaluate model
    eval_result = trainer.evaluate(eval_dataset=test_dataset)

    # writes eval result to file which can be accessed later in s3 ouput
    with open(os.path.join(args.output_data_dir, "eval_results.txt"), "w") as writer:
        print(f"***** Eval results *****")
        for key, value in sorted(eval_result.items()):
            writer.write(f"{key} = {value}\n")

    # update the config for prediction
    label2id = {
        "1 star": 0,
        "2 star": 1,
        "3 star": 2,
        "4 star": 3,
        "5 star": 4,
    id2label = {
        0: "1 star",
        1: "2 star",
        2: "3 star",
        3: "4 star",
        4: "5 star",
    trainer.model.config.label2id = label2id
    trainer.model.config.id2label = id2label

    # Saves the model to s3

The issue is that your local folder where your model is not available when running the training. Only the script will be pushed to the SageMaker environment.
That’s why I suggested pushing the model to s3 and then providing it via training_data and

I figured out my issues. The error I was getting was from loading the best model after training using the flag load_best_model_at_end=True. The model is not saved on every node when doing multi instance training, so the FileNoteFound error occurs. This flag will cause issues in the trainer if youre using transformers <= 4.6.

This issue is solved when using version of the AWS Deep Learning containers with a version of Hugging Face greater than 4.6 such as, or by setting load_best_model_at_end=False

Thanks for the help!

1 Like

You can now upgrade to the latest sagemaker version and then use pytorch_version="1.9", transformers_version="4.11"