Can load_datasets load entire text files instead of splitting on new lines?

lexandstuff · February 13, 2022, 10:46pm

Hi all,

I’m trying to train a language model using longformer-base-4096. I have a folder full of text files that I’d like to train on, some of which are multiple paragraphs long.

When I use load_datasets like this:

dataset = load_dataset('text', data_files={'train': train_files})

It appears to take each document and split it on line breaks and use each line as a datapoint. However, I’d like to use the entire document as a datapoint.

Here’s a very simple reproduction example:

from datasets import load_dataset
import os

os.mkdir('train')

txt = """
Hello world \n\n
This is a test \n\n
"""
os.mkdir('train')
with open('train/text.txt', 'w') as f:
    f.write(txt)

dataset = load_dataset('text', data_files={'train': ['train/text.txt']})
print(dataset['train'])

This outputs a dataset with 5 rows.

I’d like a dataset with one row in this case.

It is possible?

mariosasko · February 14, 2022, 2:38pm

Hi! Yes, just set sample_by to "document" in load_dataset:

dataset = load_dataset('text', data_files={'train': ['train/text.txt']}, sample_by="document")

Topic		Replies	Views
How to load text + image dataset? 🤗Datasets	2	696	February 19, 2024
Can we download dataset from folder of text file 🤗Datasets	2	1224	January 18, 2022
Load pre-existing in-memory splits into a Dataset 🤗Datasets	2	1025	November 16, 2021
Loading Dataset with custom splits 🤗Datasets	1	528	July 12, 2023
Unable to load all raw text files from dataset Beginners	1	542	November 24, 2022

Can load_datasets load entire text files instead of splitting on new lines?

Related topics