Image classification using AutoTrain: dataset preperation authentication and column mapping

I’ve previously successfully tested using AutoTrain for image classification problems but I am currently running into issues uploading data.

Authentication: most of the datasets I want to test have been uploaded to the hub but cannot be shared publicly. When I try and select these datasets, I get an error.

I didn’t see an obvious way to pass in an auth token. Is this possible?

Column selection: When adding a public dataset (in this case, using biglam/encyclopaedia_britannica_illustrated), the mapping options for image/labels differ from the underlying dataset. In this case, the original dataset exposes an ‘image’ and ‘label’ column (plus some other metadata columns). When loading in autotrain image seems to be expanded to image.src, image.height and image.width.

I’m unsure if these are internal attributes or mean to be publicly exposed? Choosing what I assume would be the correct image column image.src as the image column in the mapping results in an error when loading.

Under the training tab an error is triggered when format_source is run:

Error type: InvalidColMappingError
Details: Column mapping {'label': 'target', 'image.src': 'image'} is invalid for data with columns ['image', 'label', 'id', 'meta'].
Column 'image.src' not found in data.

I assume this is because the internal loader is looking for image.src in the dataset and not finding it.

Apologies if this has been addressed before; I dug around for other issues but didn’t see anything related.

Tagging @abhishek, who might be the best person to address this.

And related Discord discussion Discord

1 Like

Hi @davanstrien & @osanseviero,
Thank you for reporting these issues!

Re: Authentication

Training on gated or private datasets is not supported in AutoTrain yet.

Re: Column selection

There is indeed an issue with our integration with the datasets server, which AutoTrain uses to fetch the dataset’s first rows. We are currently working on fixing this. I’ll let you know when this is fixed.

1 Like

Thanks, @sbrandeis!

Thanks for confirming – some of the datasets that I’m working on are small enough to upload via the AutoTrain interface (which allows you to keep them private), so I can work around this.

Thanks for this. The last time I used AutoTrain for image classification, it worked very well, so looking forward to this fix :hugs:

Hello @davanstrien

Better late than never - we shipped a fix for the issue with image datasets today!
You should be able to train on image datasets now.

Let us know if you encounter any other issues!

1 Like

Hi, I met the similar problem in Text Classification (binary). When I select col names, it seems that all choices are from the second row . Could you please fix it?

1 Like

Can confirm this is happening with Binary Text-Classification.

Thank you for reporting. We are working on a fix.

Hi @CSAle & @VivienTang,
Thanks again for reporting the issue; it should be fixed now.
Apologies for the inconvenience!

Thanks for working on this :slight_smile:

I was successfully able to load the beans dataset for image classification. I am currently trying to upload my own dataset using both the image folder upload and a dataset hosted on the hub; however, both of these seem to get stuck in the processing step for longer than I would expect for the size of these datasets.

It’s possible I’m just being impatient, but it seems like the loading might be stuck in some way.

Update for people who might run across this in the future: the issue, in this case, was that the size of the images in my dataset where reasonably large. Resizing the images using the ImageMagick morgify command to 500px i.e. fixed this, i.e something like:

mogrify -resize 500x500^ top_level_data_folder/**/*.jpg

Will resize all images under the top_level_data_folder.

NOTE: the morgify command does resize in place (i.e. overwrites the existing image) by default. If you don’t want this, look into the -path option.

1 Like