Can we download dataset from folder of text file

My dataset has many text files, I want to first take all the text files as corpus for LM training. Then I will use the files to map the labels from it.


Hi ! Do you mean that the labels are included in the file names ?

Here is an example on how to load one of the classes using glob patterns:

data_files = {"train": "path/to/data/*<class_name>*.txt"}
dataset = load_dataset("text": data_files=data_files}, split="train")

Then you can add the column with the label:

dataset = dataset.add_column("label", ["<class_name>"] * len(dataset))

Finally if you wish to combine the datasets of each class feel free to take a look at concatenate_datasets or interleave_datasets

Thanks lhoestq for your reply! I only wanted to mix all text files in a folder to get a one big text file for training a language model on all the data.