Finetuning a model gives me an error

Hello, Im trying to finetune jjzha/jobbert_skill_extraction with a very short csv data training set.

To create the csv I have used the tokenizer from the jobbert modell to write the tokens into the left column and labels (as integers) into the right now (+padding), so it looks exactly like:

token,labels
[CLS],2
This,2
is,2
a,2
nice,2

[PAD],2

I got an error when I tryed load_dataset with “csv” param, so I went with:

df = pd.read_csv(csv_file_path)
train_df = df.sample(frac=0.8, random_state=42)
train_dataset = Dataset.from_pandas(train_df)
train_dataset = train_dataset.map(tokenization, batched=True, batch_size=batch_size)
train_dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])

and

batch_size=8
training_args = TrainingArguments(
    "trained_model", 
    per_gpu_train_batch_size=batch_size, 
    per_gpu_eval_batch_size=batch_size,
    num_train_epochs=3,
    save_steps=100
)

trainer = Trainer(
    model,
    training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=DataCollatorForTokenClassification(tokenizer=tokenizer),
    tokenizer=tokenizer
)

Im getting a TypeError(“len() of a 0-d tensor”) with this data collator, switching to DataCollatorWithPadding it tells me input batch_size of 24 is a mismatch to target batch_size of 8.

What is wrong here? + I would like to understand what happens here a bit more, this method:

def tokenization(example):
    return tokenizer(example["token"])

creates a lot of tensorsin the dataset, input_ids and all the other stuff is always something like [0,1,1] (witch three numbers) but the tensor for the labels field is only tensor(2) (one number), as Im getting some error with batch_size too (if I use the other collator) I wonder if this might be the reason here that the labels is not correctly transformed into a tensor?

Here is my complete code:

import pandas as pd
from transformers import AutoTokenizer, AutoConfig, BertForTokenClassification, TrainingArguments, Trainer, DataCollatorForTokenClassification
from datasets import Dataset, get_dataset_split_names
from torch.nn import CrossEntropyLoss

#load tokenizer
checkpoint = "jjzha/jobbert_knowledge_extraction"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = BertForTokenClassification.from_pretrained(checkpoint)
max_length = 128

#get the labels printed
config = AutoConfig.from_pretrained(checkpoint)
labels = config.id2label

#load the text that you want to use for training
text = "This is a nice car, did you learn Angular or Vue to finance that? I heared that Java is the new hotshot here. Whats the best framework to use within Java to earn money?"

#this is for writing the token values into input ids, labelids : 0=B, 2=O
def tokenization(example):
    return tokenizer(example["token"]) #return_tensors="pt", padding="max_length", truncation=True

#load_dataset from csv file
csv_file_path = "./datset/tokenized_text_single.csv"
df = pd.read_csv(csv_file_path)

# Split the DataFrame into training and validation DataFrames
train_df = df.sample(frac=0.8, random_state=42)
val_df = df.drop(train_df.index)

# Convert the DataFrames to Hugging Face Datasets
batch_size = 8
train_dataset = Dataset.from_pandas(train_df)

train_dataset = train_dataset.map(tokenization, batched=True, batch_size=batch_size)

train_dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])

for d in train_dataset:
    print("DATASET three", d)

val_dataset = Dataset.from_pandas(val_df)
val_dataset = val_dataset.map(tokenization, batched=True, batch_size=batch_size)
val_dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])

training_args = TrainingArguments(
    "trained_model", 
    per_gpu_train_batch_size=batch_size, 
    per_gpu_eval_batch_size=batch_size,
    num_train_epochs=3,
    save_steps=100
)

trainer = Trainer(
    model,
    training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=DataCollatorForTokenClassification(tokenizer=tokenizer),
    tokenizer=tokenizer
)

# Train the model
trainer.train()

This is my dataset, is this in the correct format?

DatasetDict({
    train: Dataset({
        features: ['token', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 43
    })
    eval: Dataset({
        features: ['token', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 13
    })
})

{'token': 'This', 'labels': tensor(2), 'input_ids': tensor([ 101, 1188,  102]), 'attention_mask': tensor([1, 1, 1])}
{'token': 'is', 'labels': tensor(2), 'input_ids': tensor([ 101, 1110,  102]), 'attention_mask': tensor([1, 1, 1])}
{'token': 'a', 'labels': tensor(2), 'input_ids': tensor([101, 170, 102]), 'attention_mask': tensor([1, 1, 1])}
{'token': 'nice', 'labels': tensor(2), 'input_ids': tensor([ 101, 3505,  102]), 'attention_mask': tensor([1, 1, 1])}
{'token': 'car', 'labels': tensor(2), 'input_ids': tensor([ 101, 1610,  102]), 'attention_mask': tensor([1, 1, 1])}
{'token': ',', 'labels': tensor(2), 'input_ids': tensor([101, 117, 102]), 'attention_mask': tensor([1, 1, 1])}
{'token': 'did', 'labels': tensor(2), 'input_ids': tensor([ 101, 1225,  102]), 'attention_mask': tensor([1, 1, 1])}
{'token': 'you', 'labels': tensor(2), 'input_ids': tensor([ 101, 1128,  102]), 'attention_mask': tensor([1, 1, 1])}
{'token': 'learn', 'labels': tensor(2), 'input_ids': tensor([ 101, 3858,  102]), 'attention_mask': tensor([1, 1, 1])}
{'token': 'Ang', 'labels': tensor(0), 'input_ids': tensor([  101, 26285,   102]), 'attention_mask': tensor([1, 1, 1])}
{'token': '##ular', 'labels': tensor(0), 'input_ids': tensor([  101,   108,   108, 23449,  1813,   102]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1])}

Im wondering about ‘labels’: tensor(2) and ‘input_ids’: tensor([ 101, 1188, 102]) shouldnt there be a label for each input_id?