Loading an imagenet-style image dataset with train/val directories

This is my dataset structure:

ROOT
| -- train
| ---- class_1
| ---- class_2
| -- val
| ---- class_1
| ---- class_2

I want to load this in to a DataSet. I’m expecting the resulting DataSet to have a train and val split, but I only get a 'train' partition from the following:

from datasets import load_dataset

dataset = load_dataset(
    "imagefolder", 
    data_dir=ROOT
)
>> dataset
DatasetDict({
    train: Dataset({
        features: ['image', 'label'],
        num_rows: 20
    })
})

I can read the data separately, as follows, but the DataSet object always having a split as train is confusing.

train_dataset = load_dataset(
    "imagefolder", 
    data_dir=os.path.join(ROOT, "train")
)

val_dataset = load_dataset(
    "imagefolder", 
    data_dir=os.path.join(ROOT, "val")
)

Question: How can I load data in to a one Dataset with two splits that correspond to the original data set structure?


Here’s the code to generate the dummy data set:

import os
import numpy as np
import cv2   


ROOT = "data"

for which in ["train", "val"]:
  for class_name in ["class_1", "class_2"]:
    dir_name = os.path.join(ROOT, which, class_name)
    if not os.path.exists(dir_name):
      os.makedirs(dir_name)
    for i in range(10):
       cv2.imwrite(
           os.path.join(dir_name, f"{i}.png"),
           np.random.random((224, 224))
           )

This is mostly similar to this question: Confusion in splitting dataset (from imagefolder) into train, test and validation, but I have two separate directories for train and val already.

1 Like

Hi! Only the names from this list (e.g. "valid") are allowed for the validation split directory. Perhaps we can add "val" to this list. Would you be interested in submitting a PR? Another option is to specify patterns for each split separately:

dataset = load_dataset("imagefolder", data_files={"train": f"{ROOT}/train/**", "val": f"{ROOT}/val/**"})
2 Likes

Hi, thanks a lot for the quick response.

I can send a PR - I also think that it’ll be better if this is mentioned in the docs because I’ve spent a lot of time in this trivial thing. Do you want to move this conversation to github?

1 Like

I can send a PR - I also think that it’ll be better if this is mentioned in the docs because I’ve spent a lot of time in this trivial thing

Not a bad idea.

Do you want to move this conversation to github?

Yes, let’s do this.

Here’s the github link

1 Like