Load_dataset assumes 'train'

slyle · May 30, 2023, 8:20pm

This is sort of nitpicky, but when using the following command:

load_dataset(“audiofolder”, data_dir=“path/to/data”)

it automatically assumes the resulting Dataset object inside the created DatasetDict is the ‘train’ Dataset, by naming it as such.

Why is this? It feels really awkward to write the following:

dataset = dataset[‘train’].train_test_split(test_size=0.3)

Thanks!

mariosasko · May 31, 2023, 12:44pm

You can pass split="train" to load_dataset to get a Dataset object.

slyle · May 31, 2023, 3:53pm

Thanks for your time and the tip Mario. I will pass that argument from now on. Good to know I’m not the only one who ran into this

Topic		Replies	Views
Loading an imagenet-style image dataset with train/val directories 🤗Datasets	4	1778	August 12, 2022
Not declaring splits inside of dataset loading script 🤗Datasets	2	1596	July 28, 2022
Dataset with no splits 🤗Datasets	4	3457	May 16, 2024
Loading Dataset with custom splits 🤗Datasets	1	528	July 12, 2023
Load_dataset split='test' not working 🤗Datasets	2	886	February 8, 2024