I am trying to train a transformer on my own text dataset, which requires passing it a Dataset argument. However, I have read through the Datasets guide Overview, and it seems to only discuss loading existing datasets. Maybe I missed something, but how can I create my own Dataset object from a directory of text files?
Hi @yugen2 , thanks for the feedback! We’re working on updating the tutorials to include something on how to create your own dataset.
To create your own dataset, you can do:
from datasets import load_dataset
# map specific text files to different splits
dataset = load_dataset("text", data_files={"train": ["my_text_1.txt", "my_text_2.txt"], "test": "my_test_file.txt"})
# load dataset from directory
dataset = load_dataset("text", data_dir="path/to/text/dataset")