Fine-tune transformers for language model

I’m trying to follow this tutorial Language modeling with my own dataset. I already did a few adjustments (including downgrading python to 3.9.12) to get it running as far as it is right now.

from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, pipeline, DataCollatorForLanguageModeling
import torch
from datasets import load_dataset, Dataset

model = AutoModelForCausalLM.from_pretrained("distilgpt2")
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")

dataset = load_dataset('text', data_files='list.txt')
dataset = dataset['train'].train_test_split(0.1)

def preprocess_function(examples):
    return tokenizer(examples['text'], truncation=True)

tokenized_lines = dataset.map(preprocess_function, batched=True, num_proc=4)

block_size = 128

def group_texts(examples):
    keys = list(examples.features.keys())
    total_length = len(examples[keys[0]])
    result = {
        k: [examples[k][i : i + block_size] for i in range(0, total_length, block_size)]
        for k in keys
    }
    result["labels"] = result["input_ids"].copy()
    return result

lm_dataset = tokenized_lines.map(group_texts, batched=True, num_proc=4)

I get the error

...
    769     return self._value
    770 else:
--> 771     raise self._value
IndexError: list index out of range

I can not figure out what I’m doing wrong.
The list.txt file contains sentences one per line.

Maybe trim off batches that are too large with:

if total_length >= block_size:
		total_length = (total_length // block_size) * block_size

Like this?

def group_texts(examples):
    keys = list(examples.features.keys())
    total_length = len(examples[keys[0]])
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    result = {
        k: [examples[k][i : i + block_size] for i in range(0, total_length, block_size)]
        for k in keys
    }

    result["labels"] = result["input_ids"].copy()

    return result

Sadly that doesn’t help. Also, the confusing part is

group_texts(tokenized_lines['train'])
group_texts(tokenized_lines['test'])

both work fine without error