I am trying to load up images from dataset with the following structure for fine-tuning the vision transformer model. My dataset has following structure:
I am quite confused on how to split the dataset into train, test and validation. I read various similar questions but couldn’t understand the process clearly. I have tried to do the following (train-test-validation split in a non-random manner, though actually I would like it to be randomly splitted):
from datasets import load_dataset
ds = load_dataset("imagefolder",data_dir="/Documents/DataSetFolder/",split="test")
# split up data into train + test
splits = ds.train_test_split(test_size=0.3)
train_ds = splits['train']
test_ds = splits['test']
# split up data into val + test
splits = ds.train_test_split(test_size=0.15)
test_ds = splits['test']
val_ds = splits['test']
Is this a correct process for randomly splitting the dataset into 70% training, 15% test and 15% validation? Also, how I can do a random train-test split and what is the significance of the split argument in load_dataset? Would really appreciate some guidance on this as I am very confused even after reading the documentation at https://huggingface.co/docs/datasets/image_process.
what is the significance of the split argument in load_dataset ?
If specified, this argument returns a concrete dataset split/subset instead of returning a dictionary with all the subsets. You can think of it as being equal to load_dataset("imagefolder", data_dir="/Documents/DataSetFolder/")[split]. Note that this arg also supports the slicing syntax, but you shouldn’t use it here as this doesn’t shuffle the data.