This is my dataset structure:
ROOT
| -- train
| ---- class_1
| ---- class_2
| -- val
| ---- class_1
| ---- class_2
I want to load this in to a DataSet
. I’m expecting the resulting DataSet
to have a train
and val
split, but I only get a 'train'
partition from the following:
from datasets import load_dataset
dataset = load_dataset(
"imagefolder",
data_dir=ROOT
)
>> dataset
DatasetDict({
train: Dataset({
features: ['image', 'label'],
num_rows: 20
})
})
I can read the data separately, as follows, but the DataSet
object always having a split as train
is confusing.
train_dataset = load_dataset(
"imagefolder",
data_dir=os.path.join(ROOT, "train")
)
val_dataset = load_dataset(
"imagefolder",
data_dir=os.path.join(ROOT, "val")
)
Question: How can I load data in to a one Dataset
with two splits that correspond to the original data set structure?
Here’s the code to generate the dummy data set:
import os
import numpy as np
import cv2
ROOT = "data"
for which in ["train", "val"]:
for class_name in ["class_1", "class_2"]:
dir_name = os.path.join(ROOT, which, class_name)
if not os.path.exists(dir_name):
os.makedirs(dir_name)
for i in range(10):
cv2.imwrite(
os.path.join(dir_name, f"{i}.png"),
np.random.random((224, 224))
)
This is mostly similar to this question: Confusion in splitting dataset (from imagefolder) into train, test and validation, but I have two separate directories for train and val already.