I’m trying to follow this tutorial Language modeling with my own dataset. I already did a few adjustments (including downgrading python to 3.9.12) to get it running as far as it is right now.
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, pipeline, DataCollatorForLanguageModeling
import torch
from datasets import load_dataset, Dataset
model = AutoModelForCausalLM.from_pretrained("distilgpt2")
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
dataset = load_dataset('text', data_files='list.txt')
dataset = dataset['train'].train_test_split(0.1)
def preprocess_function(examples):
return tokenizer(examples['text'], truncation=True)
tokenized_lines = dataset.map(preprocess_function, batched=True, num_proc=4)
block_size = 128
def group_texts(examples):
keys = list(examples.features.keys())
total_length = len(examples[keys[0]])
result = {
k: [examples[k][i : i + block_size] for i in range(0, total_length, block_size)]
for k in keys
}
result["labels"] = result["input_ids"].copy()
return result
lm_dataset = tokenized_lines.map(group_texts, batched=True, num_proc=4)
I get the error
...
769 return self._value
770 else:
--> 771 raise self._value
IndexError: list index out of range
I can not figure out what I’m doing wrong.
The list.txt
file contains sentences one per line.