Finetuning GPT2 model on AWS Trainium

hfreedma · October 25, 2023, 7:33pm

Hey guys,

I’m working on an AWS Trainium trn1.2xlarge instance, set up by following the instructions here. I have been able to get a finetuning script working for encoder architectures such as BERT, with generally good results. However, when I attempt to run the same script using a model with a decoder architecture, such as gpt2, I run into some problems. Here is the script I’m using to do the finetuning on Tranium:

from optimum.neuron import NeuronTrainer, NeuronTrainingArguments
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer
)
import evaluate
import numpy as np
from datasets import load_dataset

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

def tokenize_function(examples):
    return tokenizer(examples["feature1"], examples["feature2"], padding="max_length")

model_name = "bert-base-uncased"
# model_name = "gpt2"

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token == None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer))
    model.config.pad_token_id = tokenizer.pad_token_id

data_files = {'train': 'small_training_dataset.csv'}
data = load_dataset('csv', data_files=data_files)
tokenized_data = data.map(tokenize_function)

training_args = NeuronTrainingArguments(
        output_dir="trainer_args",
        per_device_train_batch_size=8,
        num_train_epochs=3,
        push_to_hub=False,
        save_strategy="no"
    )

metric = evaluate.load("../accuracy.py")

trainer = NeuronTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

trainer.train()

When model_name is “bert-base-uncased”, this script works just as expected. When model_name is “gpt2”, two things happen:

it takes a long time compared to the bert (over an hour for gpt2 vs. three minutes for bert)
after over an hour the script errors out with a difficult to debug exception, the first few lines of which I’ve pasted below:

2023-10-24 11:11:10.000839: INFO ||NCC_WRAPPER||: Compilation failed for /mp/neuroncc_compile_workdir/c67839cd-ef96-40ad-9d14-586a520818db/model.MODULE_119325758791722276+d41d8cd9.hlo.pbafter0retries.
2023-10-24 11:11:10.000901: INFO ||NCC_WRAPPER||: Compilation failed after reaching max retries.
2023-10-24 11:11:12.420754: E tensorflow/1ibtpu/neuron/neuron_compiler.cc:2161 NEURONPOC: Unable to delete temp file /tmp/MODULE_1_SyncTensorsGraph.33335_119325758791722276_1p-10-0-0-135-53510ed2-647249-60873a6917
34b.neff
2023-10-24 11:11:12.432499: E tensorflow/libtpu/neuron/neuron_compiler.c:371] NEURONPOC: Could not read NEFF from MODULE_1_SyncTensorsGraph. 33335_119325758791722276_p-10-0-0-135-53510ed2-647249-60873a691734b.neff
• Status : NOT_FOUND: /tmp/MODULE_1_SyncTensorsGraph.33335_119325758791722276_ip-10-0-0-135-53510ed2-647249-60873a691734b.neff;Nosuchfileordirectory
2023-10-24 11:11:33.553516: W tensorflow/core/framework/op_kernel.c:1830] OP_REQUIRES failed at tpu_execute_op.cc: 266 : NOT_FOUND: /tmp/MODULE_1_SyncTensorsGraph.33335_119325758791722276_1p-10-0-0-135-53510ed2-64
7249-60873a691734b.neff; No such file or directory

Any ideas for resources to check out to improve my understanding or suggestions for where to start debugging would be most welcome! I’m looking to essentially answer two questions here: Why is training with bert substantially faster than training with gpt2 on the same hardware? And, what is the cause of the missing .neff file when training with gpt2?

I can also post the full stack track trace of the error that I run into with the gpt2 model if it would be helpful.

Topic		Replies	Views
Finetuning GPT2 using Multiple GPU and Trainer 🤗Transformers	14	6769	May 22, 2023
Optimum-neuron example script fails on trainium instance 🤗Transformers	0	261	November 1, 2023
BERT inference with Hugging Face Transformers and AWS Inferentia Amazon SageMaker	0	530	May 10, 2023
Fine tuning GPT2 tensorflow 🤗Transformers	0	77	June 24, 2024
Fine tuning GPT2 on persona chat dataset outputs gibberish Models	1	2731	April 14, 2021

Finetuning GPT2 model on AWS Trainium

Related topics