Confusion in splitting dataset (from imagefolder) into train, test and validation

mariosasko · July 13, 2022, 3:02pm

Hi! Your code produces (potentially overlapping) splits of incorrect size. This is the fixed code:

from datasets import load_dataset

ds = load_dataset("imagefolder", data_dir="/Documents/DataSetFolder/", split="test")

ds_split_train_test = ds.train_test_split(test_size=0.15)

train_ds, test_ds = ds_split_train_test["train"], ds_split_train_test["test"]

ds_split_train_val = train_ds.split_train_test(test_size=0.15/0.85)

train_ds, val_ds = ds_split_train_test["train"], ds_split_train_test["test"]

what is the significance of the split argument in load_dataset ?

If specified, this argument returns a concrete dataset split/subset instead of returning a dictionary with all the subsets. You can think of it as being equal to load_dataset("imagefolder", data_dir="/Documents/DataSetFolder/")[split]. Note that this arg also supports the slicing syntax, but you shouldn’t use it here as this doesn’t shuffle the data.

Topic		Replies	Views
Loading an imagenet-style image dataset with train/val directories 🤗Datasets	4	1802	August 12, 2022
Datasets.load_dataset not returning 'eval' or 'test' 🤗Datasets	2	689	May 17, 2022
Don't know how to split imdb to train, test, validation 🤗Datasets	0	345	May 6, 2024
Split DataFrame into validation and train split 🤗Datasets	2	6554	April 11, 2022
How to split Hugging Face dataset to train and test? 🤗Datasets	5	55602	January 24, 2023

Confusion in splitting dataset (from imagefolder) into train, test and validation

Related topics