Finetuning GPT2 model on AWS Trainium

Hey guys,

I’m working on an AWS Trainium trn1.2xlarge instance, set up by following the instructions here. I have been able to get a finetuning script working for encoder architectures such as BERT, with generally good results. However, when I attempt to run the same script using a model with a decoder architecture, such as gpt2, I run into some problems. Here is the script I’m using to do the finetuning on Tranium:

from optimum.neuron import NeuronTrainer, NeuronTrainingArguments
from transformers import (
import evaluate
import numpy as np
from datasets import load_dataset

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

def tokenize_function(examples):
    return tokenizer(examples["feature1"], examples["feature2"], padding="max_length")

model_name = "bert-base-uncased"
# model_name = "gpt2"

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token == None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.config.pad_token_id = tokenizer.pad_token_id

data_files = {'train': 'small_training_dataset.csv'}
data = load_dataset('csv', data_files=data_files)
tokenized_data =

training_args = NeuronTrainingArguments(

metric = evaluate.load("../")

trainer = NeuronTrainer(


When model_name is “bert-base-uncased”, this script works just as expected. When model_name is “gpt2”, two things happen:

  1. it takes a long time compared to the bert (over an hour for gpt2 vs. three minutes for bert)
  2. after over an hour the script errors out with a difficult to debug exception, the first few lines of which I’ve pasted below:
2023-10-24 11:11:10.000839: INFO ||NCC_WRAPPER||: Compilation failed for /mp/neuroncc_compile_workdir/c67839cd-ef96-40ad-9d14-586a520818db/model.MODULE_119325758791722276+d41d8cd9.hlo.pbafter0retries.
2023-10-24 11:11:10.000901: INFO ||NCC_WRAPPER||: Compilation failed after reaching max retries.
2023-10-24 11:11:12.420754: E tensorflow/1ibtpu/neuron/ NEURONPOC: Unable to delete temp file /tmp/MODULE_1_SyncTensorsGraph.33335_119325758791722276_1p-10-0-0-135-53510ed2-647249-60873a6917
2023-10-24 11:11:12.432499: E tensorflow/libtpu/neuron/neuron_compiler.c:371] NEURONPOC: Could not read NEFF from MODULE_1_SyncTensorsGraph. 33335_119325758791722276_p-10-0-0-135-53510ed2-647249-60873a691734b.neff
• Status : NOT_FOUND: /tmp/MODULE_1_SyncTensorsGraph.33335_119325758791722276_ip-10-0-0-135-53510ed2-647249-60873a691734b.neff;Nosuchfileordirectory
2023-10-24 11:11:33.553516: W tensorflow/core/framework/op_kernel.c:1830] OP_REQUIRES failed at 266 : NOT_FOUND: /tmp/MODULE_1_SyncTensorsGraph.33335_119325758791722276_1p-10-0-0-135-53510ed2-64
7249-60873a691734b.neff; No such file or directory

Any ideas for resources to check out to improve my understanding or suggestions for where to start debugging would be most welcome! I’m looking to essentially answer two questions here: Why is training with bert substantially faster than training with gpt2 on the same hardware? And, what is the cause of the missing .neff file when training with gpt2?

I can also post the full stack track trace of the error that I run into with the gpt2 model if it would be helpful.