[Help appreciated] GPT2 Finetuning results in Only Padding output


I am currently trying to fine-tune the gpt2-medium model on a specific lecture book.
The target is to ask question about the contents of this book and get an answer from gpt2.

For now I would be happy, if anything slighlty related to the book, not even neccessarily correct is being returned.

I have used this collabbook Easy GPT2 fine-tuning with Hugging Face and PyTorch to create the base case with slightly modified code, but also ran this code as it is with my input, but the result is always the same.

The Problem

The final model after 5 epochs only returns the input text with additional padding tokens until the end. No generated results at all.
This usually means the input is somewhat wrong, or not correctly tokenized, or the training is insufficient or what ever.
The loss starts at around 1.09 and does not decrease over each epoch - bad sign.

Input Text Example

1.2  Life Situations
As mentioned before, health care starts when people are born and ends when people pass away. Sometimes, the relative share of health care in our lives is small, sometimes it becomes higher. This section gives an overview on some typical life situations. 
Health care organization and health-related processes can vary from country to country, however, these life situations seem ubiquitous. We focus on life situations related to health care; insofar this view is limited, as life is much more; but it is useful for our topic: health information systems.

This is a small text example from the book - in total 3 lines. The whole book is 2777 Lines long in total.

Code with Comments

Below you can find the full code that fine tunes the model and saves the state_dict.
I cannot find the error and would appreciate a couple more eyes to help me out. What am I doing wrong?

import os
import time
import datetime
import torch
import torch._tensor
from transformers import GPT2Tokenizer, GPT2LMHeadModel, get_linear_schedule_with_warmup
import numpy as np
from torch.utils.data import Dataset, DataLoader

print(f"Running from working directory {os.getcwd()}")

# this defines the input file
GPT_MODEL = "gpt2"

# as I have read, large batch sizes lead to memory problems
# as loss does not decrease, I don't see the point in more epochs
# standard learning rate, behaviour does not change with 5e-4, 3e-5
# maxmium sequencel length in gpt2 is apparently 768 tokens
# each line will transformed into a tensor and extended with padding tokens
# until this maximum sequence length is reached

# seed for torch to get same results with same settings
# output between training batches to see progress

# Device is set to cpu for local and CUDA if available
DEVICE = "cpu"
if torch.cuda.is_available():
    DEVICE = "cuda"
    print("Using CUDA Device")
    print("Using CPU Device")

print("loading models...")
# the default code also defines "bos_token", adding this does not change the result
tokenizer = GPT2Tokenizer.from_pretrained(GPT_MODEL, pad_token='<|pad|>', eos_token='<|endoftext|>')
# gpt2-medium model
model: GPT2LMHeadModel = GPT2LMHeadModel.from_pretrained(GPT_MODEL).to(DEVICE)
# since we added pad-token, we need to resize the model


class GPT2DataSet(Dataset):
    def __init__(self, book_dataset_path = 'books/'):
        # read in the book
        book_path = os.path.join(book_dataset_path, INPUT_FILE+".txt")
        self.input_ids = []
        self.attn_masks = []
        self.end_of_sentence_token = tokenizer.eos_token
        with open(book_path, encoding="UTF-8") as reader:
            lines = reader.readlines()
            # for each line create a tensor
            for line in lines:
                # if that line is not empty
                if len(line.strip()) > 0:
                    # append the eos token to the line
                    line_str = f"{line.strip()}{self.end_of_sentence_token}"
                    # this will create input ids and attention mask, filled up until MAX_SEQ_LEN with padding tokens
                    encodings_dict = tokenizer(line_str, truncation=True, max_length = MAX_SEQ_LEN, padding="max_length")
                    # create pytorch tensors
    def __len__(self):
        return len(self.input_ids)
    def __getitem__(self, index):
        return self.input_ids[index], self.attn_masks[index]

def format_time(elapsed):
    return str(datetime.timedelta(seconds=int(round((elapsed)))))

print("Reading in data...")
dataset = GPT2DataSet()
text_loader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=False)

# total number of steps to perform
total_steps = len(text_loader)*EPOCHS

# using the huggingface AdamW optimizer does not change the results
optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps= total_steps)

# for total training time
total_t0 = time.time()
total_train_loss = 0
# output folder for the model
models_folder = "models"

# start training model
for epoch in range(EPOCHS):
    print(f'======== Epoch {epoch+1} / {EPOCHS} ========')
    # start time of each epoch
    t0 = time.time()
    # for reach batch
    for step, batch in enumerate(text_loader):
        # set labels to input_ids
        input_tensor = batch[0]
        labels = batch[0]
        # __get_item()__ returns input_ids and attention_mask
        masks = batch[1]
        # I have seen code that zero_grads the model after each step
        # generate outputs from model
        outputs = model(input_tensor, labels=labels, attention_mask = masks, token_type_ids=None)
        # get the loss of the model
        loss = outputs[0]
        batch_loss = loss.item()
        total_train_loss += batch_loss
        # propagate backwards, step optimizer and scheduler
        # output between training for progress tracking
        if step % SAMPLE_EVERY == 0 and not step == 0:
            elapsed = format_time(time.time() - t0)
            print(f'  Batch {step:>5,}  of  {len(text_loader):>5,}. Loss: {batch_loss:>5,}.   Elapsed: {elapsed}.')
    # calculate avg training loss per step
    avg_train_loss = total_train_loss / len(text_loader)
    training_time = format_time(time.time() - t0)
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print(f"  Training epoch took: {training_time}")

# training finished, save model
print("Training complete!")
print(f"Total training took {format_time(time.time()-total_t0)} (h:mm:ss)")      
torch.save(model.state_dict(), os.path.join(models_folder, f"{GPT_MODEL}_{INPUT_FILE}.pt"))

Hi @Scorix ,
I have just started working on GPT-2 model for content removal use-case.
I am not familiar with a sufficient amount of knowledge to help you, YET.

But could you let me know how much time your model took to train on your knowledge base.

Thanks in advance!

It took about 5-10 minutes on a T100 GPU.
The Hugging Face Trainer library results in an actual usable model, so I am pretty sure that this code has some error somewhere. Maybe in the way I am feeding the dataset into the model (or transform the text into the dataset)