Can we download dataset from folder of text file

NahedAbdelgaber · December 30, 2021, 4:43am

Hi,
My dataset has many text files, I want to first take all the text files as corpus for LM training. Then I will use the files to map the labels from it.

Thanks

lhoestq · January 10, 2022, 11:25am

Hi ! Do you mean that the labels are included in the file names ?

Here is an example on how to load one of the classes using glob patterns:

data_files = {"train": "path/to/data/*<class_name>*.txt"}
dataset = load_dataset("text": data_files=data_files}, split="train")

Then you can add the column with the label:

dataset = dataset.add_column("label", ["<class_name>"] * len(dataset))

Finally if you wish to combine the datasets of each class feel free to take a look at concatenate_datasets or interleave_datasets

NahedAbdelgaber · January 18, 2022, 6:08am

Thanks lhoestq for your reply! I only wanted to mix all text files in a folder to get a one big text file for training a language model on all the data.

Topic		Replies	Views
Making a dataset that read the labels from parent folders Intermediate	0	536	December 2, 2021
Can load_datasets load entire text files instead of splitting on new lines? Beginners	1	1725	February 14, 2022
How to load text + image dataset? 🤗Datasets	2	704	February 19, 2024
Contructing a dataset with categorical labels 🤗Datasets	2	595	July 18, 2023
Adding labels from different files Beginners	2	14	June 3, 2025

Can we download dataset from folder of text file

Related topics