Creating Vision dataset with images on s3

Hi everyone,

I’m trying to create a :hugs: dataset for an object detection task. The training images are stored on s3 and I would like to eventually use sagemaker and a :hugs: estimator to train the model.

I’m trying to build on the example from @philschmid in Huggingface Sagemaker - Vision Transformer but with my own dataset and the model from Fine-tuning DETR on a custom dataset for object detection by @nielsr .

If I understand correctly I need to create a dataset first and then save it in the session bucket on s3 but I am not entirely sure how to do that with a dataset which is too big to pull locally first in order to create it.

I have found the load_dataset function with the ‘imagefolder’ option which seems to do what I want for local image files but doesn’t seem to support filepaths on s3. I have also found the load_from_disk function which seems to do the loading for :hugs: datasets from s3 but doesn’t have an imagefolder option.

What is the best way to prepare my data in this case?

Thanks for the help!

Hello @cotrane,

you don’t need to save the dataset in the session bucket or do the pre-processing in advance. You could do everything inside sagemaker.

Meaning you can either have as first the the download of your dataset from s3 to local and the use load_dataset or just provide the S3 URI when calling to your bucket. Those S3 URIs don’t need to be on the session bucket they could be on any bucket.{'train': 'yourbucket'})

sagemaker will then download it for you and save it into /opt/ml/input/train

Ah ok - that makes sense! Thanks for the help. I’ll try that today!