Loading an imagenet-style image dataset with train/val directories

akt42 · August 12, 2022, 9:12am

This is my dataset structure:

ROOT
| -- train
| ---- class_1
| ---- class_2
| -- val
| ---- class_1
| ---- class_2

I want to load this in to a DataSet. I’m expecting the resulting DataSet to have a train and val split, but I only get a 'train' partition from the following:

from datasets import load_dataset

dataset = load_dataset(
    "imagefolder", 
    data_dir=ROOT
)

>> dataset
DatasetDict({
    train: Dataset({
        features: ['image', 'label'],
        num_rows: 20
    })
})

I can read the data separately, as follows, but the DataSet object always having a split as train is confusing.

train_dataset = load_dataset(
    "imagefolder", 
    data_dir=os.path.join(ROOT, "train")
)

val_dataset = load_dataset(
    "imagefolder", 
    data_dir=os.path.join(ROOT, "val")
)

Question: How can I load data in to a one Dataset with two splits that correspond to the original data set structure?

Here’s the code to generate the dummy data set:

import os
import numpy as np
import cv2   


ROOT = "data"

for which in ["train", "val"]:
  for class_name in ["class_1", "class_2"]:
    dir_name = os.path.join(ROOT, which, class_name)
    if not os.path.exists(dir_name):
      os.makedirs(dir_name)
    for i in range(10):
       cv2.imwrite(
           os.path.join(dir_name, f"{i}.png"),
           np.random.random((224, 224))
           )

This is mostly similar to this question: Confusion in splitting dataset (from imagefolder) into train, test and validation, but I have two separate directories for train and val already.

mariosasko · August 12, 2022, 12:02pm

Hi! Only the names from this list (e.g. "valid") are allowed for the validation split directory. Perhaps we can add "val" to this list. Would you be interested in submitting a PR? Another option is to specify patterns for each split separately:

dataset = load_dataset("imagefolder", data_files={"train": f"{ROOT}/train/**", "val": f"{ROOT}/val/**"})

akt42 · August 12, 2022, 12:49pm

Hi, thanks a lot for the quick response.

I can send a PR - I also think that it’ll be better if this is mentioned in the docs because I’ve spent a lot of time in this trivial thing. Do you want to move this conversation to github?

mariosasko · August 12, 2022, 12:54pm

I can send a PR - I also think that it’ll be better if this is mentioned in the docs because I’ve spent a lot of time in this trivial thing

Not a bad idea.

Do you want to move this conversation to github?

Yes, let’s do this.

akt42 · August 12, 2022, 1:26pm

Here’s the github link

Topic		Replies	Views
Confusion in splitting dataset (from imagefolder) into train, test and validation 🤗Datasets	2	5730	August 12, 2022
Load_dataset assumes 'train' Beginners	2	932	May 31, 2023
Loading multiple custom splits using `load_dataset('audiofolder', data_dir=/some/path)` Beginners	4	769	November 13, 2023
How can I load the (labels of the) imagenet val dataset? 🤗Datasets	0	924	May 29, 2023
Loading Dataset with custom splits 🤗Datasets	1	529	July 12, 2023

Loading an imagenet-style image dataset with train/val directories

Related topics