Finetuning sentence embedding model with SageMaker - how to compute loss?

I’m looking for a model that will return an embedding vector that can be used in downstream classification tasks. I have been able to deploy the pretrained model sentence-transformers/all-mpnet-base-v2 · Hugging Face

to an endpoint and get embeddings from it. However, when I try to finetune the model with huggingface_estimator I get an error as it doesn’t appear this model returns a loss since it doesn’t have labels (the data I am supplying is simply more domain specific text examples. How can I pass a loss when fine tuning a sentence transformer?

So fist I used the HuggingFaceModel from the sagemaker toolkit

from sagemaker.huggingface import HuggingFaceModel

hub = {
  'HF_MODEL_ID':'sentence-transformers/all-mpnet-base-v2',
  'HF_TASK':'feature-extraction'
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    transformers_version='4.6',
    pytorch_version='1.7',
    py_version='py36',
    env=hub,
    role=role,
)

In this example from amazon Fine-tune and host Hugging Face BERT models on Amazon SageMaker | AWS Machine Learning Blog they are first fine tuning the model using the huggingfacee_estimator, and then they create a new HuggingFaceModel object and pass the model_data from the huggingface_estimator fine tuning job. Like this:

from sagemaker.huggingface.model import HuggingFaceModel

huggingface_model = sagemaker.huggingface.HuggingFaceModel(
env={ 'HF_TASK':'sentiment-analysis' },
model_data=huggingface_estimator.model_data,
role=role, # iam role with permissions to create an Endpoint
transformers_version="4.6.1", # transformers version used
pytorch_version="1.7.1", # pytorch version used
py_version='py36', # python version
)

My problem is when I try to finetune on my data, since it is not a classifier, the model output is not return a loss and I get an invalid key error from compute_loss() in transformer/trainer.py transformers/trainer.py at 3977b58437b8ce1ea1da6e31747d888efec2419b · huggingface/transformers · GitHub


KeyError: 'loss'

Does it make sense to finetune an embedding model? Is there a way to pass a loss function and have it included in the model output.

I am just passing it additional unlabelled data. How would one do this in sagemaker given that it is a feature extraction model and not a classification/prediction model?

Thanks

Hello @kjackson,

How did you try to fine-tune it the code you shared is only on how to deploy it.
We have a detailed “Getting Started” example with video support to run your first training on Amazon SageMaker: Get started This might help you get started.

Does it make sense to finetune an embedding model? Is there a way to pass a loss function and have it included in the model output.

Yes it makes sense to further fine-tune a language model to let it “learn” a new context/domain

Here is the code. I’m using a p3.2xlarge SageMaker notebook instance. If I run the code with a conda_pytorch_p36 kernel I get the error

`

ImportError: Using FP16 with APEX but APEX is not installed, please refer to GitHub - NVIDIA/apex: A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch.

`

if I run it with the pytorch_latest I get

> 
> ~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/transformers/trainer.py in compute_loss(self, model, inputs, return_outputs)
>    1774         else:
>    1775             # We don't use .loss here since the model may return tuples instead of ModelOutput.
> -> 1776             loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]
>    1777 
>    1778         return (loss, outputs) if return_outputs else loss
> 
> ~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/transformers/file_utils.py in __getitem__(self, k)
>    1736         if isinstance(k, str):
>    1737             inner_dict = {k: v for (k, v) in self.items()}
> -> 1738             return inner_dict[k]
>    1739         else:
>    1740             return self.to_tuple()[k]
> 
> KeyError: 'loss'

I believe I’m getting the ‘loss’ key error because I’m doing something wrong with the training arguments. I have been trying to pass different loss functions etc, but no luck so far.

Full code to reproduce on p3.2xlarge below - thanks for the help!


!pip install "sagemaker>=2.48.0" "transformers==4.6.1" "datasets[s3]==1.6.2" --upgrade

# Download some training data

!wget https://github.com/saurabh3949/Text-Classification-Datasets/raw/master/dbpedia_csv.tar.gz
!tar -xzvf dbpedia_csv.tar.gz

import pandas as pd
import json
import os

# Write small train and test files

df = pd.read_csv('dbpedia_csv/train.csv', header = None)

# write as small train input file
with open('train_text.json', 'w') as outfile:
    for desc in df.iloc[:10000, 2]:
        json.dump({"inputs": desc}, outfile)
        outfile.write('\n')
        
with open('test_text.json', 'w') as outfile:
    for desc in df.iloc[10000:15000, 2]:
        json.dump({"inputs": desc}, outfile)
        outfile.write('\n')

!sudo chmod 777 /opt/ml

from transformers import AutoTokenizer, AutoModel
from transformers import BertTokenizer, BertForSequenceClassification
import torch
import torch.nn.functional as F


args = lambda: None

from transformers import (
    AutoModelForSequenceClassification,
    AutoModel,
    Trainer,
    TrainingArguments,
    AutoTokenizer,
    default_data_collator,
)

from datasets import load_dataset
import sys
import os

raw_train_dataset = load_dataset("json", data_files='train_text.json')["train"]
raw_test_dataset = load_dataset("json", data_files='test_text.json')["train"]

args.model_id = 'facebook/bart-large-mnli'

tokenizer = AutoTokenizer.from_pretrained(args.model_id)
model = AutoModel.from_pretrained(args.model_id)


# preprocess function, tokenizes text
def preprocess_function(examples):
    return tokenizer(examples["inputs"], padding = 'max_length', max_length=52, truncation=True)

# preprocess dataset
train_dataset = raw_train_dataset.map(
    preprocess_function,
    batched=True,
)
test_dataset = raw_test_dataset.map(
    preprocess_function,
    batched=True,
)


# define training args
training_args = TrainingArguments(
    output_dir='/opt/ml/model',
    num_train_epochs=1,
    per_device_train_batch_size=32,
    warmup_steps=500,
    fp16=True,
    do_eval = False,
    save_strategy="epoch",
    logging_dir="s3://tmp232/sourcedir.tar.gz/logs",
    learning_rate=float(3e-05)
)

trainer =Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    data_collator=default_data_collator
)

# train model
trainer.train()

@kjackson please take a look at the documentation and video I shared.
You are trying to fine-tune the model inside the Notebook Instance and not using the DLC/Integration we have built together with AWS.

If you aren’t that familiar with how to fine-tune Hugging Face Transformers for the specific tasks you should definitely check out our course at Transformer models - Hugging Face Course

Thanks, good point. I was following along the course when I ran into a memory error when running on sagemaker and was trying to debug locally, but then I got sidetracked with other errors. I’ve included the full code that uses the DLC etc.

It gives me the following error:

RuntimeError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 15.78 GiB total capacity; 14.68 GiB already allocated; 130.75 MiB free; 14.74 GiB reserved in total by PyTorch)

Below is my training script, as well as the notebook that sends the job to sagemaker. I was hoping to jump straight in and use the API, but I may have put the cart before the horse :stuck_out_tongue: I’ll continue to debug this on my own and will review all the course material. In the meantime, if you see something obvious that is causing this issue please do let me know. I was told at first that it had to do with batch sizes, but I tried several settings without any luck.

I think my main issue is related to the loss but not sure if that ties in to the memory issue. Are there any examples of the train script for a feature extraction/embedding model I can look at? Or if there is a specific part of the course that deals with this specifically.

From the evaluation section of the fine tuning section:

The function must take an EvalPrediction object (which is a named tuple with a predictions field and a label_ids field)

So how do I handle this for an unsupervised learning task? This would affect other TrainingArguments like evaluation_strategy, load_best_model_at_end, metric_for_best_model etc. It’s difficult to figure out the right arguments when I’m also dealing with the cuda memory error, any pointers are much appreciated. The examples seem mainly geared toward supervised learning, and I was hoping for an error message that would indicate what needs to changed about the loss (since my train.py compute_loss uses labels when there are none). The memory error seems to indicate that there is no GPU memory, and checking the instance metrics it looks like the GPU is not used:

image

Not sure why that would be as num_gpus is set to 1 in SM_TRAINING_ENV:

Again will continue on with the documentation and course material until I get this sorted out, but any help is appreciated :pray:

train.py

from transformers import (
    AutoModel,
    Trainer,
    TrainingArguments,
    AutoTokenizer,
    default_data_collator,
)

from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from datasets import load_dataset
import random
import logging
import sys
import argparse
import os
import torch

if __name__ == "__main__":

    parser = argparse.ArgumentParser()

    # hyperparameters sent by the client are passed as command-line arguments to the script.
    parser.add_argument("--epochs", type=int, default=3)
    parser.add_argument("--train_batch_size", type=int, default=32)
    parser.add_argument("--eval_batch_size", type=int, default=64)
    parser.add_argument("--warmup_steps", type=int, default=500)
    parser.add_argument("--model_id", type=str)
    parser.add_argument("--learning_rate", type=str, default=5e-5)
    parser.add_argument("--train_file", type=str, default="train_text.json")
    parser.add_argument("--test_file", type=str, default="test_text.json")
    parser.add_argument("--fp16", type=bool, default=True)

    # Data, model, and output directories
    parser.add_argument("--output_data_dir", type=str, default=os.environ["SM_OUTPUT_DATA_DIR"])
    parser.add_argument("--model_dir", type=str, default=os.environ["SM_MODEL_DIR"])
    parser.add_argument("--n_gpus", type=str, default=os.environ["SM_NUM_GPUS"])
    parser.add_argument("--training_dir", type=str, default=os.environ["SM_CHANNEL_TRAIN"])
    parser.add_argument("--test_dir", type=str, default=os.environ["SM_CHANNEL_TEST"])

    args, _ = parser.parse_known_args()

    # Set up logging
    logger = logging.getLogger(__name__)

    logging.basicConfig(
        level=logging.getLevelName("INFO"),
        handlers=[logging.StreamHandler(sys.stdout)],
        format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
    )
    
    print('\nWalk:')
    for path, subdirs, files in os.walk('/opt/ml'): 
        for name in files: print(os.path.join(path, name))
          

    # load datasets
    raw_train_dataset = load_dataset("json", data_files=os.path.join(args.training_dir, args.train_file))["train"]
    raw_test_dataset = load_dataset("json", data_files=os.path.join(args.test_dir, args.test_file))["train"]

    # load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(args.model_id)

    # preprocess function, tokenizes text
    def preprocess_function(examples):
        return tokenizer(examples["inputs"], padding="max_length", truncation=True)

    # preprocess dataset
    train_dataset = raw_train_dataset.map(
        preprocess_function,
        batched=True,
    )
    test_dataset = raw_test_dataset.map(
        preprocess_function,
        batched=True,
    )


    # print size
    logger.info(f" loaded train_dataset length is: {len(train_dataset)}")
    logger.info(f" loaded test_dataset length is: {len(test_dataset)}")

    # compute metrics function for binary classification
    def compute_metrics(pred):
        labels = pred.label_ids
        preds = pred.predictions.argmax(-1)
        precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="micro")
        acc = accuracy_score(labels, preds)
        return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}

    print('\nargs.model_id', args.model_id)
    # download model from model hub
    model = AutoModel.from_pretrained(args.model_id)

    # define training args
    training_args = TrainingArguments(
        output_dir=args.model_dir,
        num_train_epochs=args.epochs,
        per_device_train_batch_size=args.train_batch_size,
        per_device_eval_batch_size=args.eval_batch_size,
        warmup_steps=args.warmup_steps,
        fp16=args.fp16,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        logging_dir=f"{args.output_data_dir}/logs",
        learning_rate=float(args.learning_rate),
        load_best_model_at_end=True,
        metric_for_best_model="f1",
    )

    # create Trainer instance
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        tokenizer=tokenizer,
        data_collator=default_data_collator,
    )

    # train model
    trainer.train()

    # evaluate model
    eval_result = trainer.evaluate(eval_dataset=test_dataset)

    # writes eval result to file which can be accessed later in s3 ouput
    with open(os.path.join(args.output_data_dir, "eval_results.txt"), "w") as writer:
        print(f"***** Eval results *****")
        for key, value in sorted(eval_result.items()):
            writer.write(f"{key} = {value}\n")


    # Saves the model to s3
    trainer.save_model(args.model_dir)

.
.
.
.
.
Execution notebook

!pip install "sagemaker>=2.48.0"


import sagemaker

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

# Download some training data

!wget https://github.com/saurabh3949/Text-Classification-Datasets/raw/master/dbpedia_csv.tar.gz
!tar -xzvf dbpedia_csv.tar.gz

import pandas as pd
import json
import os

# Write small train and test files

df = pd.read_csv('dbpedia_csv/train.csv', header = None)

# write as small train input file
with open('train_text.json', 'w') as outfile:
    for desc in df.iloc[:10000, 2]:
        json.dump({"inputs": desc}, outfile)
        outfile.write('\n')
        
with open('test_text.json', 'w') as outfile:
    for desc in df.iloc[10000:15000, 2]:
        json.dump({"inputs": desc}, outfile)
        outfile.write('\n')

from sagemaker.s3 import S3Uploader

s3_prefix = 'batch-data'

training_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/train'
test_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/test'


# upload datasets
train_remote = S3Uploader.upload('train_text.json',training_input_path)
test_remote = S3Uploader.upload('test_text.json',test_input_path)

print(f"train dataset uploaded to: \n{train_remote}\n{test_remote}")


from sagemaker.huggingface import HuggingFace
import time

mid = 'facebook/bart-large-mnli'



# hyperparameters, which are passed into the training job
hyperparameters={'epochs': 1,                          # number of training epochs
                 'train_batch_size': 32,               # batch size for training
                 'eval_batch_size': 64,                # batch size for evaluation
                 'learning_rate': 3e-5,                # learning rate used during training
                 'model_id':mid, # pre-trained model
                 'fp16': True,                         # Whether to use 16-bit (mixed) precision training
                 'train_file': 'train_text.json',    # training dataset
                 'test_file': 'test_text.json',      # test dataset
                 }


metric_definitions=[
    {'Name': 'eval_loss',               'Regex': "'eval_loss': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_accuracy',           'Regex': "'eval_accuracy': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_f1',                 'Regex': "'eval_f1': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_precision',          'Regex': "'eval_precision': ([0-9]+(.|e\-)[0-9]+),?"}]


# define Training Job Name 
job_name = f'hf--{mid.replace("/", "-")}--{time.strftime("%H-%M-%S", time.localtime())}'

instance = 'ml.p3.2xlarge'


# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'train.py',        # fine-tuning script used in training jon
    source_dir           = 'scripts',      # directory where fine-tuning script is stored
    instance_type        = instance,   # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    transformers_version = '4.6',             # the transformers version used in the training job
    pytorch_version      = '1.7',             # the pytorch_version version used in the training job
    py_version           = 'py36',            # the python version used in the training job
    hyperparameters      = hyperparameters,   # the hyperparameter used for running the training job
    metric_definitions   = metric_definitions # the metrics regex definitions to extract logs
)

# define a data input dictonary with our uploaded s3 uris
training_data = {
    'train': train_remote,
    'test': test_remote
}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(training_data, wait=False)

I went through the documentation/course/videos in more detail and ended up making some updates to the train.py. I believe possibly the memory error was due to using max padding lengh and the default datacollator instead of one with padding. I fixed that and removed all the the train args so it is not trying to run evaluation. It looks like I’ve come full cirlce now, as I’m getting the same ‘loss’ key error in sagemaker as I’m getting locally.

See below for the full training script and notebook execution code, but the error is the same as what I first mentioned in the beginning of this thread:

Any idea how I can get around this? I did the full pyrtorch implementation from here and got

AttributeError: ‘Seq2SeqModelOutput’ object has no attribute ‘loss’

So it appears it’s defaulting to the wrong type? I can share the code for that as well, but it’s essentially the same but I’m just using ‘facebook/bart-large-mnli’ and unlabelled data. I’m at a “loss” of what to try next. ha. ha.

train.py

from transformers import (
    AutoModel,
    Trainer,
    TrainingArguments,
    AutoTokenizer,
    AutoFeatureExtractor,
    default_data_collator,
)

from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from datasets import load_dataset
import random
import logging
import sys
import argparse
import os
import torch

if __name__ == "__main__":

    parser = argparse.ArgumentParser()

    # hyperparameters sent by the client are passed as command-line arguments to the script.
    parser.add_argument("--epochs", type=int, default=3)
    parser.add_argument("--train_batch_size", type=int, default=32)
    parser.add_argument("--eval_batch_size", type=int, default=64)
    parser.add_argument("--warmup_steps", type=int, default=500)
    parser.add_argument("--model_id", type=str)
    parser.add_argument("--learning_rate", type=str, default=5e-5)
    parser.add_argument("--train_file", type=str, default="train_text.json")
    parser.add_argument("--test_file", type=str, default="test_text.json")
    parser.add_argument("--fp16", type=bool, default=True)

    # Data, model, and output directories
    parser.add_argument("--output_data_dir", type=str, default=os.environ["SM_OUTPUT_DATA_DIR"])
    parser.add_argument("--model_dir", type=str, default=os.environ["SM_MODEL_DIR"])
    parser.add_argument("--n_gpus", type=str, default=os.environ["SM_NUM_GPUS"])
    parser.add_argument("--training_dir", type=str, default=os.environ["SM_CHANNEL_TRAIN"])
    parser.add_argument("--test_dir", type=str, default=os.environ["SM_CHANNEL_TEST"])

    args, _ = parser.parse_known_args()

    # Set up logging
    logger = logging.getLogger(__name__)

    logging.basicConfig(
        level=logging.getLevelName("INFO"),
        handlers=[logging.StreamHandler(sys.stdout)],
        format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
    )
    
    print('\nWalk:')
    for path, subdirs, files in os.walk('/opt/ml'): 
        for name in files: print(os.path.join(path, name))
          

    # load datasets
    raw_train_dataset = load_dataset("json", data_files=os.path.join(args.training_dir, args.train_file))["train"]
    raw_test_dataset = load_dataset("json", data_files=os.path.join(args.test_dir, args.test_file))["train"]

    # load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(args.model_id)

    # preprocess function, tokenizes text
    def preprocess_function(examples):
        return tokenizer(examples["inputs"], truncation=True)

    # preprocess dataset
    train_dataset = raw_train_dataset.map(
        preprocess_function,
        batched=True,
    )
    test_dataset = raw_test_dataset.map(
        preprocess_function,
        batched=True,
    )


    # print size
    logger.info(f" loaded train_dataset length is: {len(train_dataset)}")
    logger.info(f" loaded test_dataset length is: {len(test_dataset)}")

    

    print('\nargs.model_id', args.model_id)
    # download model from model hub
    model = AutoModel.from_pretrained(args.model_id)

    # define training args
    training_args = TrainingArguments(
        output_dir=args.model_dir,
#         num_train_epochs=args.epochs,
#         per_device_train_batch_size=args.train_batch_size,
#         per_device_eval_batch_size=args.eval_batch_size,
#         warmup_steps=args.warmup_steps,
#         fp16=args.fp16,
#         evaluation_strategy="epoch",
#         save_strategy="epoch",
#         logging_dir=f"{args.output_data_dir}/logs",
#         learning_rate=float(args.learning_rate),
#         load_best_model_at_end=True,
#         metric_for_best_model="f1",
    )

    # create Trainer instance
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        tokenizer=tokenizer,
#         data_collator=default_data_collator,
    )

    # train model
    trainer.train()
    

    # evaluate model
    eval_result = trainer.evaluate(eval_dataset=test_dataset)

    # writes eval result to file which can be accessed later in s3 ouput
    with open(os.path.join(args.output_data_dir, "eval_results.txt"), "w") as writer:
        print(f"***** Eval results *****")
        for key, value in sorted(eval_result.items()):
            writer.write(f"{key} = {value}\n")


    # Saves the model to s3
    trainer.save_model(args.model_dir)

.
.
.
.
notebook execution:

!pip install "sagemaker>=2.48.0"


import sagemaker

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

# Download some training data

!wget https://github.com/saurabh3949/Text-Classification-Datasets/raw/master/dbpedia_csv.tar.gz
!tar -xzvf dbpedia_csv.tar.gz

import pandas as pd
import json
import os

# Write small train and test files

df = pd.read_csv('dbpedia_csv/train.csv', header = None)

# write as small train input file
with open('train_text.json', 'w') as outfile:
    for desc in df.iloc[:10000, 2]:
        json.dump({"inputs": desc}, outfile)
        outfile.write('\n')
        
with open('test_text.json', 'w') as outfile:
    for desc in df.iloc[10000:15000, 2]:
        json.dump({"inputs": desc}, outfile)
        outfile.write('\n')

from sagemaker.s3 import S3Uploader

s3_prefix = 'batch-data'

training_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/train'
test_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/test'


# upload datasets
train_remote = S3Uploader.upload('train_text.json',training_input_path)
test_remote = S3Uploader.upload('test_text.json',test_input_path)

print(f"train dataset uploaded to: \n{train_remote}\n{test_remote}")


from sagemaker.huggingface import HuggingFace
import time

mid = 'facebook/bart-large-mnli'




# hyperparameters, which are passed into the training job
hyperparameters={'epochs': 1,                          # number of training epochs
                 'train_batch_size': 32,               # batch size for training
                 'eval_batch_size': 64,                # batch size for evaluation
                 'learning_rate': 3e-5,                # learning rate used during training
                 'model_id':mid, # pre-trained model
                 'fp16': True,                         # Whether to use 16-bit (mixed) precision training
                 'train_file': 'train_text.json',    # training dataset
                 'test_file': 'test_text.json',      # test dataset
                 }


metric_definitions=[
    {'Name': 'eval_loss',               'Regex': "'eval_loss': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_accuracy',           'Regex': "'eval_accuracy': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_f1',                 'Regex': "'eval_f1': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_precision',          'Regex': "'eval_precision': ([0-9]+(.|e\-)[0-9]+),?"}]


# define Training Job Name 
job_name = f'hf--{mid.replace("/", "-")}--{time.strftime("%H-%M-%S", time.localtime())}'

instance = 'ml.p3.2xlarge'


# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'train.py',        # fine-tuning script used in training jon
    source_dir           = 'scripts',      # directory where fine-tuning script is stored
    instance_type        = instance,   # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    transformers_version = '4.6',             # the transformers version used in the training job
    pytorch_version      = '1.7',             # the pytorch_version version used in the training job
    py_version           = 'py36',            # the python version used in the training job
    hyperparameters      = hyperparameters,   # the hyperparameter used for running the training job
    metric_definitions   = metric_definitions # the metrics regex definitions to extract logs
)

# define a data input dictonary with our uploaded s3 uris
training_data = {
    'train': train_remote,
    'test': test_remote
}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(training_data, wait=True)

From your code i can see your are trying to fine-tune a Seq2Seq model.

But using the Trainer and AutoModel classes to load it which shouldn’t be correct.
On which task do you want to fine-tune your model?

I am trying to just get embeddings, so the task I’m looking for is feature extraction I believe. However, I looked at Auto Classes but it looks like AutoFeatureExtractor is used for images? I get a not found error when I try this so not sure that is the right apporach eighter. In a nutshell here is what I’m trying to do.

  1. Download a pretrained model for doing pooled embedding extractions
  2. Fine tune this model on my own corpus (transactional data)
  3. Run a batch embedding job on my data to extract the embedding features, that can then be used by a classifier down stream (which takes other inputs in addition to embeddings)

we are currently using sagemaker’s BlazingText in this manner, but I wanted to give HF a try since we can use it for many different tasks (NER etc).

The reason I went with ‘facebook/bart-large-mnli’ was because I was able to successfully deploy this to an endpoint and get embeddings just like I want. In addition, it performed well on zero shot tasks in my domain, and in the deploy example on the model cart “feature-extraction” is listed as a task so I took that to mean it could be trained as such. However, when I deploy to an endpoint I get to specify “feature-extraction” as the task and it just works. Not sure how I can do something similar during finetuning, or if I need to look at a different model?

If look under ‘feature-extraction’ as a task in the model hub, the most popular model is “sentence-transformers/distilbert-base-nli-mean-tokens” . However, this model gives uniform probabilities for zero-shot (i.e. all predictions are ~ 1/num_classes), and it also gives me the same loss error when I try to fine tune so that’s why I didn’t use this. See below for the zero shot output for “sentence-transformers/distilbert-base-nli-mean-tokens” , if I use ‘facebook/bart-large-mnli’ instead I get expected more reasonable probabilities.

@philschmid think I was able to figure out the solution for this (been going through oreilley book and pt 2 of the course). I’ll post it here when I have it, but in the meantime you can disregard this thread. Thanks for the help so far :+1: