Hello, Im trying to finetune jjzha/jobbert_skill_extraction with a very short csv data training set.
To create the csv I have used the tokenizer from the jobbert modell to write the tokens into the left column and labels (as integers) into the right now (+padding), so it looks exactly like:
token,labels
[CLS],2
This,2
is,2
a,2
nice,2
…
[PAD],2
…
I got an error when I tryed load_dataset with “csv” param, so I went with:
df = pd.read_csv(csv_file_path)
train_df = df.sample(frac=0.8, random_state=42)
train_dataset = Dataset.from_pandas(train_df)
train_dataset = train_dataset.map(tokenization, batched=True, batch_size=batch_size)
train_dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])
and
batch_size=8
training_args = TrainingArguments(
"trained_model",
per_gpu_train_batch_size=batch_size,
per_gpu_eval_batch_size=batch_size,
num_train_epochs=3,
save_steps=100
)
trainer = Trainer(
model,
training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
data_collator=DataCollatorForTokenClassification(tokenizer=tokenizer),
tokenizer=tokenizer
)
Im getting a TypeError(“len() of a 0-d tensor”) with this data collator, switching to DataCollatorWithPadding it tells me input batch_size of 24 is a mismatch to target batch_size of 8.
What is wrong here? + I would like to understand what happens here a bit more, this method:
def tokenization(example):
return tokenizer(example["token"])
creates a lot of tensorsin the dataset, input_ids and all the other stuff is always something like [0,1,1] (witch three numbers) but the tensor for the labels field is only tensor(2) (one number), as Im getting some error with batch_size too (if I use the other collator) I wonder if this might be the reason here that the labels is not correctly transformed into a tensor?
Here is my complete code:
import pandas as pd
from transformers import AutoTokenizer, AutoConfig, BertForTokenClassification, TrainingArguments, Trainer, DataCollatorForTokenClassification
from datasets import Dataset, get_dataset_split_names
from torch.nn import CrossEntropyLoss
#load tokenizer
checkpoint = "jjzha/jobbert_knowledge_extraction"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = BertForTokenClassification.from_pretrained(checkpoint)
max_length = 128
#get the labels printed
config = AutoConfig.from_pretrained(checkpoint)
labels = config.id2label
#load the text that you want to use for training
text = "This is a nice car, did you learn Angular or Vue to finance that? I heared that Java is the new hotshot here. Whats the best framework to use within Java to earn money?"
#this is for writing the token values into input ids, labelids : 0=B, 2=O
def tokenization(example):
return tokenizer(example["token"]) #return_tensors="pt", padding="max_length", truncation=True
#load_dataset from csv file
csv_file_path = "./datset/tokenized_text_single.csv"
df = pd.read_csv(csv_file_path)
# Split the DataFrame into training and validation DataFrames
train_df = df.sample(frac=0.8, random_state=42)
val_df = df.drop(train_df.index)
# Convert the DataFrames to Hugging Face Datasets
batch_size = 8
train_dataset = Dataset.from_pandas(train_df)
train_dataset = train_dataset.map(tokenization, batched=True, batch_size=batch_size)
train_dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])
for d in train_dataset:
print("DATASET three", d)
val_dataset = Dataset.from_pandas(val_df)
val_dataset = val_dataset.map(tokenization, batched=True, batch_size=batch_size)
val_dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])
training_args = TrainingArguments(
"trained_model",
per_gpu_train_batch_size=batch_size,
per_gpu_eval_batch_size=batch_size,
num_train_epochs=3,
save_steps=100
)
trainer = Trainer(
model,
training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
data_collator=DataCollatorForTokenClassification(tokenizer=tokenizer),
tokenizer=tokenizer
)
# Train the model
trainer.train()