Undesired behavior when using load_dataset


I have a folder with 6.000 images, and a metadata.csv file that contains two columns: file_name and label.

I am trying to create a dataset using this command:

from datasets import load_dataset

dataset = load_dataset("imagefolder", data_dir="data/food_imgs", split='train')

But, as a result, I only get one row:

    features: ['image', 'label'],
    num_rows: 1

How can I fix this? I should have 6.000 rows in the dataset.

I tried to execute your code with an image folder in my computer. In the folder, I have two subfolders namely ants and bees. There are 245 images in two categories. After running your script, I get the output below. It seems it is working as expected.

features: [‘image’, ‘label’],
num_rows: 245

When I include two more lines to the code as below, I can get the following output showing that there are two labels attached to the images.


{‘image’: <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=768x512 at 0x1F0AC70A3D0>, ‘label’: 0}
{‘image’: <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=500x375 at 0x1F0AC70A1C0>, ‘label’: 1}

What version of datasets do you have?

It seems the version of datasets is 2.9.0. I installed it using PyCharm package manager.

I finally found the issue. One of the images contained the word “training” in its file name, and it seems that, in this case, the load_dataset function assumes that I only want to upload images with “training” in the name, as if it was the split.

That is why I was only getting one image as a return. This should be better documented