Creating my own Dataset

Hello,

I am trying to train a transformer on my own text dataset, which requires passing it a Dataset argument. However, I have read through the Datasets guide Overview, and it seems to only discuss loading existing datasets. Maybe I missed something, but how can I create my own Dataset object from a directory of text files?

Thanks!

1 Like

Hello, if you are not going to make a very large data set, you can create a csv file with excel and upload it through the hugging face.

Hi @yugen2 , thanks for the feedback! We’re working on updating the tutorials to include something on how to create your own dataset.

To create your own dataset, you can do:

from datasets import load_dataset

# map specific text files to different splits
dataset = load_dataset("text", data_files={"train": ["my_text_1.txt", "my_text_2.txt"], "test": "my_test_file.txt"})

# load dataset from directory
dataset = load_dataset("text", data_dir="path/to/text/dataset")

Check out the Load text data guide for more information, and if you’re curious about loading other file formats, take a look at our Local and remote files guide. :smile: