Undesired behavior when using load_dataset

baldesco · March 30, 2023, 5:31pm

Hello,

I have a folder with 6.000 images, and a metadata.csv file that contains two columns: file_name and label.

I am trying to create a dataset using this command:

from datasets import load_dataset

dataset = load_dataset("imagefolder", data_dir="data/food_imgs", split='train')

But, as a result, I only get one row:

Dataset({
    features: ['image', 'label'],
    num_rows: 1
})

How can I fix this? I should have 6.000 rows in the dataset.

akuysal · March 30, 2023, 11:06pm

I tried to execute your code with an image folder in my computer. In the folder, I have two subfolders namely ants and bees. There are 245 images in two categories. After running your script, I get the output below. It seems it is working as expected.

Dataset({
features: [‘image’, ‘label’],
num_rows: 245
})

When I include two more lines to the code as below, I can get the following output showing that there are two labels attached to the images.

print(dataset[0])
print(dataset[150])

{‘image’: <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=768x512 at 0x1F0AC70A3D0>, ‘label’: 0}
{‘image’: <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=500x375 at 0x1F0AC70A1C0>, ‘label’: 1}

baldesco · March 30, 2023, 11:32pm

What version of datasets do you have?

akuysal · April 1, 2023, 8:39pm

It seems the version of datasets is 2.9.0. I installed it using PyCharm package manager.

baldesco · April 17, 2023, 1:25pm

I finally found the issue. One of the images contained the word “training” in its file name, and it seems that, in this case, the load_dataset function assumes that I only want to upload images with “training” in the name, as if it was the split.

That is why I was only getting one image as a return. This should be better documented

Topic		Replies	Views
Loading an imagenet-style image dataset with train/val directories 🤗Datasets	4	1792	August 12, 2022
Load_dataset with labels 🤗Datasets	0	265	April 16, 2024
Load dataset from imagefolder I get error: ValueError: Instruction "train" corresponds to no data! 🤗Datasets	2	946	July 30, 2024
Load_datasets not working 🤗Datasets	1	531	February 20, 2024
ImageFolder dataset builder for HF Hub dataset 🤗Datasets	5	280	February 26, 2024

Undesired behavior when using load_dataset

Related topics