Hi all,
I’m trying to train a language model using longformer-base-4096
. I have a folder full of text files that I’d like to train on, some of which are multiple paragraphs long.
When I use load_datasets like this:
dataset = load_dataset('text', data_files={'train': train_files})
It appears to take each document and split it on line breaks and use each line as a datapoint. However, I’d like to use the entire document as a datapoint.
Here’s a very simple reproduction example:
from datasets import load_dataset
import os
os.mkdir('train')
txt = """
Hello world \n\n
This is a test \n\n
"""
os.mkdir('train')
with open('train/text.txt', 'w') as f:
f.write(txt)
dataset = load_dataset('text', data_files={'train': ['train/text.txt']})
print(dataset['train'])
This outputs a dataset with 5 rows.
I’d like a dataset with one row in this case.
It is possible?