Multi Instance Training Error

Ive been working closely with AWS to solve this issue. They told me to post here. Ive been trying to get multi instance working with AWS Sagemaker x Hugging Face estimators. My code works okay for single instance non distributed training and single instance distributed training. It does not work for multi instance distributed training. I am using the huggingface-pytorch-training:1.7-transformers4.6-gpu-py36-cu110-ubuntu18.04 image. The image is in our internal ECR because we run in a VPC.

Here is the code I am using. Its calling the same train.py from this repo (SageMaker-HuggingFace-Workshop/train.py at main · C24IO/SageMaker-HuggingFace-Workshop · GitHub). I get a FileNotFoundError error after training when the script is trying to load the model. I must be forgetting to set the correct path somewhere.

import sagemaker
import time
from sagemaker.huggingface import HuggingFace
import logging
import os
from sagemaker.s3 import S3Uploader

role = 'ROLE'
default_bucket = 'BUCKET_NAME'

local_train_dataset = "amazon_us_reviews_apparel_v1_00_train.json"
local_test_dataset = "amazon_us_reviews_apparel_v1_00_test.json"

# s3 uris for datasets
remote_train_dataset = f"s3://{default_bucket}/"
remote_test_dataset = f"s3://{default_bucket}/"


# upload datasets
S3Uploader.upload(local_train_dataset,remote_train_dataset)
S3Uploader.upload(local_test_dataset,remote_test_dataset)

print(f"train dataset uploaded to: {remote_train_dataset}/{local_train_dataset}")
print(f"test dataset uploaded to: {remote_test_dataset}/{local_test_dataset}")



# hyperparameters, which are passed into the training job
hyperparameters={'epochs': 1,                          # number of training epochs
                 'train_batch_size': 32,               # batch size for training
                 'eval_batch_size': 64,                # batch size for evaluation
                 'learning_rate': 3e-5,                # learning rate used during training
                 'model_id':'distilbert-base-uncased', # pre-trained model
                 'fp16': True,                         # Whether to use 16-bit (mixed) precision training
                 'train_file': local_train_dataset,    # training dataset
                 'test_file': local_test_dataset,      # test dataset
                 }

metric_definitions=[
    {'Name': 'eval_loss',               'Regex': "'eval_loss': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_accuracy',           'Regex': "'eval_accuracy': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_f1',                 'Regex': "'eval_f1': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_precision',          'Regex': "'eval_precision': ([0-9]+(.|e\-)[0-9]+),?"}]


# define Training Job Name 
job_name = f'huggingface-workshop-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'

discovery_bucket_kms = 'KMS'
subnets = ['subnet-xxx']
security_group_ids = ['sg-xxx','sg-xxz','sg-xxy','sg-xx10']
distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}

logging.debug('Creating the Estimator')

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'train.py',      
    source_dir           = 'scripts',     
    instance_type        = 'ml.p3.16xlarge',
    instance_count       = 2,             
    base_job_name        = job_name,        
    role                 = role,            
    transformers_version = '4.6',            
    pytorch_version      = '1.7',             
    py_version           = 'py36',          
    hyperparameters      = hyperparameters,   
    metric_definitions   = metric_definitions,
    sagemaker_session=sess,
    distribution = distribution, 

    # SECURITY CONFIGS
    output_kms_key = discovery_bucket_kms,
    subnets = subnets,
    security_group_ids = security_group_ids,
    enable_network_isolation = True,
    encrypt_inter_container_traffic = True,
    
    image_uri = 'INTERNAL_ECR_URI'
    )

# define a data input dictonary with our uploaded s3 uris
training_data = {
    'train': remote_train_dataset,
    'test': remote_test_dataset
}
logging.debug('Running Fit')
huggingface_estimator.fit(training_data)


Hello @bkuchars,

Thank you for opening this thread!

To confirm it you are running inside a VPC without internet access?
If yes the script train.py cannot work since in the lines 55 and 87 you are loading the tokenizer and model with the .from_pretrained method. Transformers is trying to load the model from Models - Hugging Face and without internet access, this won’t work.

To solve this you would need to upload your model to s3 and then you can provide the third key in training_data with "model" pointing to a s3uri. SageMaker will then also load the model on start up to the runtime and then you can load the model from disk.
Therefore you would need to change the parameters in .from_pretrained() to the local directory where to model is saved. In this case, it should be /opt/ml/input/data/model

Yes, running without Internet access. I apologize, I did change train.py to pass a local directory into from_pretrained(). Below is the exactly train.py used. I have a scripts/ folder which contains train.py and the folder distilbert-base-cased-local which contains all model files.

from transformers import (
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments,
    AutoTokenizer,
    default_data_collator,
)
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from datasets import load_dataset
import random
import logging
import sys
import argparse
import os
import torch

if __name__ == "__main__":

    parser = argparse.ArgumentParser()

    # hyperparameters sent by the client are passed as command-line arguments to the script.
    parser.add_argument("--epochs", type=int, default=3)
    parser.add_argument("--train_batch_size", type=int, default=32)
    parser.add_argument("--eval_batch_size", type=int, default=64)
    parser.add_argument("--warmup_steps", type=int, default=500)
    parser.add_argument("--model_id", type=str)
    parser.add_argument("--learning_rate", type=str, default=5e-5)
    parser.add_argument("--train_file", type=str, default="amazon_us_reviews_apparel_v1_00_train.json")
    parser.add_argument("--test_file", type=str, default="amazon_us_reviews_apparel_v1_00_test.json")
    parser.add_argument("--fp16", type=bool, default=True)

    # Data, model, and output directories
    parser.add_argument("--output_data_dir", type=str, default=os.environ["SM_OUTPUT_DATA_DIR"])
    parser.add_argument("--model_dir", type=str, default=os.environ["SM_MODEL_DIR"])
    parser.add_argument("--n_gpus", type=str, default=os.environ["SM_NUM_GPUS"])
    parser.add_argument("--training_dir", type=str, default=os.environ["SM_CHANNEL_TRAIN"])
    parser.add_argument("--test_dir", type=str, default=os.environ["SM_CHANNEL_TEST"])

    args, _ = parser.parse_known_args()

    # Set up logging
    logger = logging.getLogger(__name__)

    logging.basicConfig(
        level=logging.getLevelName("INFO"),
        handlers=[logging.StreamHandler(sys.stdout)],
        format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
    )

    print('here')
    # load datasets
    raw_train_dataset = load_dataset("json", data_files=os.path.join(args.training_dir, args.train_file))["train"]
    raw_test_dataset = load_dataset("json", data_files=os.path.join(args.test_dir, args.test_file))["train"]
    
    # load tokenizer
    tokenizer = AutoTokenizer.from_pretrained('distilbert-base-cased-local')

    # preprocess function, tokenizes text
    def preprocess_function(examples):
        return tokenizer(examples["review"], padding="max_length", truncation=True)

    # preprocess dataset
    train_dataset = raw_train_dataset.map(
        preprocess_function,
        batched=True,
    )
    test_dataset = raw_test_dataset.map(
        preprocess_function,
        batched=True,
    )

    # define labels
    num_labels = len(train_dataset.unique("label"))

    # print size
    logger.info(f" loaded train_dataset length is: {len(train_dataset)}")
    logger.info(f" loaded test_dataset length is: {len(test_dataset)}")

    # compute metrics function for binary classification
    def compute_metrics(pred):
        labels = pred.label_ids
        preds = pred.predictions.argmax(-1)
        precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="micro")
        acc = accuracy_score(labels, preds)
        return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}

    # download model from model hub
    model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-cased-local', num_labels=num_labels)

    # define training args
    training_args = TrainingArguments(
        output_dir=args.model_dir,
        num_train_epochs=args.epochs,
        per_device_train_batch_size=args.train_batch_size,
        per_device_eval_batch_size=args.eval_batch_size,
        warmup_steps=args.warmup_steps,
        fp16=args.fp16,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        logging_dir=f"{args.output_data_dir}/logs",
        learning_rate=float(args.learning_rate),
        load_best_model_at_end=True,
        metric_for_best_model="f1",
    )

    # create Trainer instance
    trainer = Trainer(
        model=model,
        args=training_args,
        compute_metrics=compute_metrics,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        tokenizer=tokenizer,
        data_collator=default_data_collator,
    )

    # train model
    trainer.train()

    # evaluate model
    eval_result = trainer.evaluate(eval_dataset=test_dataset)

    # writes eval result to file which can be accessed later in s3 ouput
    with open(os.path.join(args.output_data_dir, "eval_results.txt"), "w") as writer:
        print(f"***** Eval results *****")
        for key, value in sorted(eval_result.items()):
            writer.write(f"{key} = {value}\n")

    # update the config for prediction
    label2id = {
        "1 star": 0,
        "2 star": 1,
        "3 star": 2,
        "4 star": 3,
        "5 star": 4,
    }
    id2label = {
        0: "1 star",
        1: "2 star",
        2: "3 star",
        3: "4 star",
        4: "5 star",
    }
    trainer.model.config.label2id = label2id
    trainer.model.config.id2label = id2label

    # Saves the model to s3
    trainer.save_model(args.model_dir)


The issue is that your local folder where your model is not available when running the training. Only the script will be pushed to the SageMaker environment.
That’s why I suggested pushing the model to s3 and then providing it via training_data and huggingface_estimator.fit(training_data)

I figured out my issues. The error I was getting was from loading the best model after training using the flag load_best_model_at_end=True. The model is not saved on every node when doing multi instance training, so the FileNoteFound error occurs. This flag will cause issues in the trainer if youre using transformers <= 4.6.

This issue is solved when using version of the AWS Deep Learning containers with a version of Hugging Face greater than 4.6 such as 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:1.9.0-transformers4.11.0-gpu-py38-cu111-ubuntu20.04, or by setting load_best_model_at_end=False

Thanks for the help!

1 Like

You can now upgrade to the latest sagemaker version and then use pytorch_version="1.9", transformers_version="4.11"