Creating Vision dataset with images on s3

Hi everyone,

I’m trying to create a :hugs: dataset for an object detection task. The training images are stored on s3 and I would like to eventually use sagemaker and a :hugs: estimator to train the model.

I’m trying to build on the example from @philschmid in Huggingface Sagemaker - Vision Transformer but with my own dataset and the model from Fine-tuning DETR on a custom dataset for object detection by @nielsr .

If I understand correctly I need to create a dataset first and then save it in the session bucket on s3 but I am not entirely sure how to do that with a dataset which is too big to pull locally first in order to create it.

I have found the load_dataset function with the ‘imagefolder’ option which seems to do what I want for local image files but doesn’t seem to support filepaths on s3. I have also found the load_from_disk function which seems to do the loading for :hugs: datasets from s3 but doesn’t have an imagefolder option.

What is the best way to prepare my data in this case?

Thanks for the help!

Hello @cotrane,

you don’t need to save the dataset in the session bucket or do the pre-processing in advance. You could do everything inside sagemaker.

Meaning you can either have as first the the download of your dataset from s3 to local and the use load_dataset or just provide the S3 URI when calling HuggingFace.fit() to your bucket. Those S3 URIs don’t need to be on the session bucket they could be on any bucket.

huggingface_estimator.fit({'train': 'yourbucket'})

sagemaker will then download it for you and save it into /opt/ml/input/train

Ah ok - that makes sense! Thanks for the help. I’ll try that today!