I’m writing a custom pipeline to train a DistilBERT for token classification on some data that I got in a COLLN file. This file is already split in sentences, for example:
Hello,0
world,0This,0
is,0
a,0
post,0
The split between sentences is done with just an empty line as it is required by the run_ner.py script. Right now, I’m loading the dataset with the datasets package, such as:
dataset = load_dataset('csv', data_files={
'train': DATA_FOLDER + 'train.csv',
'test': DATA_FOLDER + 'test.csv',
'dev': DATA_FOLDER + 'dev.csv'
}, sep="\t", column_names=["tokens", "tags"])
But it looks like the split on sentences is not considered and all the tokens are loaded in a consecutive manner.
Is there something to make load_dataset understand that the data is split in sentences and tokens?