Hi,
My dataset has many text files, I want to first take all the text files as corpus for LM training. Then I will use the files to map the labels from it.
Thanks
Hi,
My dataset has many text files, I want to first take all the text files as corpus for LM training. Then I will use the files to map the labels from it.
Thanks
Hi ! Do you mean that the labels are included in the file names ?
Here is an example on how to load one of the classes using glob patterns:
data_files = {"train": "path/to/data/*<class_name>*.txt"}
dataset = load_dataset("text": data_files=data_files}, split="train")
Then you can add the column with the label:
dataset = dataset.add_column("label", ["<class_name>"] * len(dataset))
Finally if you wish to combine the datasets of each class feel free to take a look at concatenate_datasets or interleave_datasets
Thanks lhoestq for your reply! I only wanted to mix all text files in a folder to get a one big text file for training a language model on all the data.