Image classification using AutoTrain: dataset preperation authentication and column mapping

davanstrien · November 29, 2022, 11:06am

I’ve previously successfully tested using AutoTrain for image classification problems but I am currently running into issues uploading data.

Authentication: most of the datasets I want to test have been uploaded to the hub but cannot be shared publicly. When I try and select these datasets, I get an error.

I didn’t see an obvious way to pass in an auth token. Is this possible?

Column selection: When adding a public dataset (in this case, using biglam/encyclopaedia_britannica_illustrated), the mapping options for image/labels differ from the underlying dataset. In this case, the original dataset exposes an ‘image’ and ‘label’ column (plus some other metadata columns). When loading in autotrain image seems to be expanded to image.src, image.height and image.width.

I’m unsure if these are internal attributes or mean to be publicly exposed? Choosing what I assume would be the correct image column image.src as the image column in the mapping results in an error when loading.

Under the training tab an error is triggered when format_source is run:

Error type: InvalidColMappingError
Details: Column mapping {'label': 'target', 'image.src': 'image'} is invalid for data with columns ['image', 'label', 'id', 'meta'].
Column 'image.src' not found in data.

I assume this is because the internal loader is looking for image.src in the dataset and not finding it.

Apologies if this has been addressed before; I dug around for other issues but didn’t see anything related.

Tagging @abhishek, who might be the best person to address this.

osanseviero · November 30, 2022, 8:35am

And related Discord discussion Discord

sbrandeis · November 30, 2022, 11:31am

Hi @davanstrien & @osanseviero,
Thank you for reporting these issues!

Re: Authentication

Training on gated or private datasets is not supported in AutoTrain yet.

Re: Column selection

There is indeed an issue with our integration with the datasets server, which AutoTrain uses to fetch the dataset’s first rows. We are currently working on fixing this. I’ll let you know when this is fixed.

davanstrien · November 30, 2022, 11:46am

Thanks, @sbrandeis!

Thanks for confirming – some of the datasets that I’m working on are small enough to upload via the AutoTrain interface (which allows you to keep them private), so I can work around this.

Thanks for this. The last time I used AutoTrain for image classification, it worked very well, so looking forward to this fix

sbrandeis · December 7, 2022, 3:24pm

Hello @davanstrien

Better late than never - we shipped a fix for the issue with image datasets today!
You should be able to train on image datasets now.

Let us know if you encounter any other issues!

VivienTang · December 8, 2022, 3:38am

Hi, I met the similar problem in Text Classification (binary). When I select col names, it seems that all choices are from the second row . Could you please fix it?

CSAle · December 8, 2022, 6:36am

Can confirm this is happening with Binary Text-Classification.

abhishek · December 8, 2022, 6:54am

Thank you for reporting. We are working on a fix.

sbrandeis · December 8, 2022, 10:28am

Hi @CSAle & @VivienTang,
Thanks again for reporting the issue; it should be fixed now.
Apologies for the inconvenience!

davanstrien · December 9, 2022, 12:57pm

Thanks for working on this

I was successfully able to load the beans dataset for image classification. I am currently trying to upload my own dataset using both the image folder upload and a dataset hosted on the hub; however, both of these seem to get stuck in the processing step for longer than I would expect for the size of these datasets.

It’s possible I’m just being impatient, but it seems like the loading might be stuck in some way.

davanstrien · January 18, 2023, 6:23pm

Update for people who might run across this in the future: the issue, in this case, was that the size of the images in my dataset where reasonably large. Resizing the images using the ImageMagick morgify command to 500px i.e. fixed this, i.e something like:

mogrify -resize 500x500^ top_level_data_folder/**/*.jpg

Will resize all images under the top_level_data_folder.

NOTE: the morgify command does resize in place (i.e. overwrites the existing image) by default. If you don’t want this, look into the -path option.

Topic		Replies	Views
AutoTrain - unable to upload the dataset 🤗AutoTrain	7	2793	August 9, 2022
Autotrain LLM fine tuning data mapping problem 🤗AutoTrain	0	483	July 5, 2023
Error in AutoTrain Text Classification 🤗AutoTrain	12	1474	April 22, 2024
AutoTrain Token Classification Error 🤗AutoTrain	0	248	March 12, 2024
Column Mapping in Autotrain 🤗AutoTrain	1	35	April 11, 2025

Image classification using AutoTrain: dataset preperation authentication and column mapping

Related topics