Confusion in splitting dataset (from imagefolder) into train, test and validation

Hi! Your code produces (potentially overlapping) splits of incorrect size. This is the fixed code:

from datasets import load_dataset

ds = load_dataset("imagefolder", data_dir="/Documents/DataSetFolder/", split="test")

ds_split_train_test = ds.train_test_split(test_size=0.15)

train_ds, test_ds = ds_split_train_test["train"], ds_split_train_test["test"]

ds_split_train_val = train_ds.split_train_test(test_size=0.15/0.85)

train_ds, val_ds = ds_split_train_test["train"], ds_split_train_test["test"]

what is the significance of the split argument in load_dataset ?

If specified, this argument returns a concrete dataset split/subset instead of returning a dictionary with all the subsets. You can think of it as being equal to load_dataset("imagefolder", data_dir="/Documents/DataSetFolder/")[split]. Note that this arg also supports the slicing syntax, but you shouldn’t use it here as this doesn’t shuffle the data.

2 Likes