Load_dataset assumes 'train'

This is sort of nitpicky, but when using the following command:

load_dataset(“audiofolder”, data_dir=“path/to/data”)

it automatically assumes the resulting Dataset object inside the created DatasetDict is the ‘train’ Dataset, by naming it as such.

Why is this? It feels really awkward to write the following:

dataset = dataset[‘train’].train_test_split(test_size=0.3)

Thanks!

You can pass split="train" to load_dataset to get a Dataset object.

I agree this is not the best design, so we will eventually start returning the concatenation of all the splits by default (see Reduce friction in tabular dataset workflow by eliminating having splits when dataset is loaded · Issue #5189 · huggingface/datasets · GitHub).

Thanks for your time and the tip Mario. I will pass that argument from now on. Good to know I’m not the only one who ran into this :smiley: