Loading COLNN already split in sentences

zhcr · August 27, 2021, 3:35pm

I’m writing a custom pipeline to train a DistilBERT for token classification on some data that I got in a COLLN file. This file is already split in sentences, for example:

Hello,0
world,0

This,0
is,0
a,0
post,0

The split between sentences is done with just an empty line as it is required by the run_ner.py script. Right now, I’m loading the dataset with the datasets package, such as:

dataset = load_dataset('csv', data_files={
    'train': DATA_FOLDER + 'train.csv',
    'test': DATA_FOLDER + 'test.csv',
    'dev': DATA_FOLDER + 'dev.csv'
}, sep="\t", column_names=["tokens", "tags"])

But it looks like the split on sentences is not considered and all the tokens are loaded in a consecutive manner.

Is there something to make load_dataset understand that the data is split in sentences and tokens?

Topic		Replies	Views
Loading Custom Datasets 🤗Datasets	7	10725	May 25, 2021
Split document into sentences for sentence embedding Beginners	2	6923	February 9, 2021
Can load_datasets load entire text files instead of splitting on new lines? Beginners	1	1739	February 14, 2022
Multiple sentences in RoBERTa training 🤗Datasets	0	573	August 10, 2021
Can every line in the input CSV file contain more than one sentence when pertraining BERT for MLM Loss? Models	0	247	February 23, 2021

Loading COLNN already split in sentences

Related topics