So i am doing my first berttoken classifier. I am using a german polyglot dataset meaning tokenised words and lists of ner labels.
a row is [‘word1’,‘word2’…] [‘ORG’,‘LOC’…]
This is my code
tokenizer = BertTokenizer.from_pretrained('bert-base-german-cased')
encoded_dataset = [tokenizer(item['words'], is_split_into_words=True,return_tensors="pt", padding='max_length', truncation=True, max_length=128) for item in dataset_1]
model = BertForTokenClassification.from_pretrained('bert-base-german-cased', num_labels=1)
for item in encoded_dataset:
for key in item:
item[key] = torch.squeeze(item[key])
train_set = encoded_dataset[:500]
test_set = encoded_dataset[500:]
training_args = TrainingArguments(
no_cuda=False, # defaults to false anyway, just to be explicit
trainer = Trainer(
And i am getting key error loss
Could you post the error ?
The problem seems to be in the Trainer. How is your data encoded ? can you show the shape, type ans how it looks before passing to the trainer ?
your num_lables = 1 , Are you doing single classification ?
Try putting num_train_epoch to flioat number = 1.0 to see if it works and also check the number of label? if it really 1 label in your training data ?
num of labels was a mistake i changed it to 4 since they are 4 types. I didnt do any further encoding to the data than this code
when you change label does it outputs the same result ?
yes the float number doesnt change it
Could you print dataset_1 to see how it looks ?
I think maybe you should change dataset to Dataset type and then rewrite like this :
tokenized_dataset = dataset_1.map(lambda x: tokenizer(x[‘words’], is_split_into_words=True,return_tensors=“pt”, padding=‘max_length’, truncation=True, max_length=128)
I think you donot need to loop into dataset_1 but rather pass the column words dataset_1[‘words’] directly to the tokenizer or transform to Dataset format. Datasets — datasets 1.16.1 documentation
the tokenized dataset didnt work. I think i need to do some label encoding first for ner but not sure how to go about that