AI & tokenizer is GPT-2. You can ask me for more specific details about my code or Transformers setup.
Some parts were taken from the documentation and modified to my liking.
I am making a comment generator based on the provided comment list from my files. However, whenever I try and train the AI (GPT-2) on my data, it returns this error:
Please note that I’m new to this type of stuff, so the issue is most likely more clearer to you all experienced people. Here’s my code, sorry if it’s messy:
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from datasets import load_dataset
import evaluate, numpy, json
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2", is_split_into_words=True)
if tokenizer.pad_token is None:
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.resize_token_embeddings(len(tokenizer))
listOfJSONS = []
with open("aCommentsList.txt", "r") as cmt:
split = cmt.read().split("\n")
current = {"text": []}
stop = 2000
for i, data in enumerate(split):
if i == stop:
stop += 2000
open(f"input/json{stop / 2000}.json", "w").write(json.dumps(current))
current = {"text": []}
listOfJSONS.append(stop / 2000)
if len(data.strip()) == 0: continue
current["text"].append(data)
def tokenize(examples):
if isinstance(examples["text"], list):
examples["text"] = [str(text) for text in examples["text"]]
else:
examples["text"] = str(examples["text"])
return tokenizer(examples["text"], padding="max_length", truncation=True, return_tensors="pt")
def metrics(_eval):
logits, labels = _eval
predictions = numpy.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
metric = evaluate.load("accuracy")
arguments = TrainingArguments(output_dir="AIOutput", eval_strategy="epoch")
def doSomeStuff():
dataset = load_dataset("json", data_dir="input", split="train").train_test_split(train_size=1, test_size=1)
name = ["name"] * len(dataset["train"])
labels = ["label"] * len(dataset["train"])
dataset["train"].add_column("name", name)
dataset["train"].add_column("label", labels)
tokenized = dataset.map(tokenize, batched=True)
trainDataset = tokenized["train"].shuffle(seed=42).select(range(1))
evalDataset = tokenized["test"].shuffle(seed=42).select(range(1))
trainer = Trainer(
model=model,
args=arguments,
train_dataset=trainDataset,
eval_dataset=evalDataset,
compute_metrics=metrics
)
trainer.train()
doSomeStuff()
As you can see, I attempted to combat this issue by attempting to create a name and label table, but it only put more gasoline on the fire. How do I prevent this issue?
Thanks for your support, this issue has been bugging me for hours.