Can load_datasets load entire text files instead of splitting on new lines?

Hi all,

I’m trying to train a language model using longformer-base-4096. I have a folder full of text files that I’d like to train on, some of which are multiple paragraphs long.

When I use load_datasets like this:

dataset = load_dataset('text', data_files={'train': train_files})

It appears to take each document and split it on line breaks and use each line as a datapoint. However, I’d like to use the entire document as a datapoint.

Here’s a very simple reproduction example:

from datasets import load_dataset
import os

os.mkdir('train')

txt = """
Hello world \n\n
This is a test \n\n
"""
os.mkdir('train')
with open('train/text.txt', 'w') as f:
    f.write(txt)

dataset = load_dataset('text', data_files={'train': ['train/text.txt']})
print(dataset['train'])

This outputs a dataset with 5 rows.

I’d like a dataset with one row in this case.

It is possible?

Hi! Yes, just set sample_by to "document" in load_dataset:

dataset = load_dataset('text', data_files={'train': ['train/text.txt']}, sample_by="document")
2 Likes