Hi! Your code produces (potentially overlapping) splits of incorrect size. This is the fixed code:
from datasets import load_dataset
ds = load_dataset("imagefolder", data_dir="/Documents/DataSetFolder/", split="test")
ds_split_train_test = ds.train_test_split(test_size=0.15)
train_ds, test_ds = ds_split_train_test["train"], ds_split_train_test["test"]
ds_split_train_val = train_ds.split_train_test(test_size=0.15/0.85)
train_ds, val_ds = ds_split_train_test["train"], ds_split_train_test["test"]
what is the significance of the split argument in load_dataset ?
If specified, this argument returns a concrete dataset split/subset instead of returning a dictionary with all the subsets. You can think of it as being equal to load_dataset("imagefolder", data_dir="/Documents/DataSetFolder/")[split]
. Note that this arg also supports the slicing syntax, but you shouldn’t use it here as this doesn’t shuffle the data.