So i am doing my first berttoken classifier. I am using a german polyglot dataset meaning tokenised words and lists of ner labels.
a row is [‘word1’,‘word2’…] [‘ORG’,‘LOC’…]
This is my code tokenizer = BertTokenizer.from_pretrained('bert-base-german-cased') encoded_dataset = [tokenizer(item['words'], is_split_into_words=True,return_tensors="pt", padding='max_length', truncation=True, max_length=128) for item in dataset_1] model = BertForTokenClassification.from_pretrained('bert-base-german-cased', num_labels=1)
for item in encoded_dataset:
for key in item:
item[key] = torch.squeeze(item[key])
train_set = encoded_dataset[:500]
test_set = encoded_dataset[500:]
training_args = TrainingArguments(
num_train_epochs=1,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
output_dir='results',
logging_dir='logs',
no_cuda=False, # defaults to false anyway, just to be explicit
I think you donot need to loop into dataset_1 but rather pass the column words dataset_1[‘words’] directly to the tokenizer or transform to Dataset format. Datasets — datasets 1.16.1 documentation