Create own dataset for NER

Hello all,

I have the following challenge: I want to make a custom-NER model with BERT.
Using these instructions (link), I have already been able to successfully train the bert-base-german-cased on the following data set german-ler.
Now, in a second step, I would like to create my own data set and fine-tune the aforementioned BERT model with it.
However, I could not find anything suitable in the documentation for creating my own data set. Only something about Q&A (link). I could not find any other instructions that worked either.

Therefore, I have done the following so far: With the help of BERT’s tokenizer, I have broken down my texts into subwords and labelled them in IOB2 format with the help of another separate column. I then saved my data as a csv file.

Alex S-PER
is O
going O
with O
Marty B-PER
A. I-PER

In a next step, I read in the .csv as a DataFrame and converted it into a dataset. For this purpose, I divided the dataset into a training and a test set.

import pandas as pd
from datasets import Dataset, DatasetDict,ClassLabel
raw_data= pd.read_csv(r'C:\Users\dataset.csv', encoding='unicode_escape',on_bad_lines='skip',delimiter=';')

ges_dataset = Dataset.from_pandas(raw_data)
ges_dataset = ges_dataset.class_encode_column("label")

train_dataset, validation_dataset= ges_dataset.train_test_split(test_size=0.2).values()
dataDict = DatasetDict({"train":train_dataset,"validation":validation_dataset})

When I now try to read in the data set and proceed according to the instructions for fine-tuning, I unfortunately always get an error:

    label_names_1 = dataset_1["train"].features["tags"].feature.names

AttributeError: 'Value' object has no attribute 'feature'

I suspect that the error arises from the fact that I am not creating the dataset correctly.

The following questions therefore arise for me:

  1. how do I create a data set for my own text correctly? Does anyone here happen to have good documentation or instructions?
  2. how should the data set be structured schematically? If I split the record completely, it is no longer clear where it ends and the next one begins. If I don’t do this, then I can no longer distinguish between the individual labels of the tags. For example, [B-Start, I-Start] would not be two separate labels, but an independent label even in combination.

I would be very happy if anyone has had the same difficulties and can help me here.
Many thanks and best regards Tom

3 Likes

Did you solve this issue yet? I’m having the same issue.

Hi Dnsibu,
unfortunately, I have not yet been able to solve this.

I find this post helpful.

python - Creating HuggingFace Dataset to train an BIO tagger - Stack Overflow