Creating Vision dataset with images on s3

Hi everyone,

I’m trying to create a :hugs: dataset for an object detection task. The training images are stored on s3 and I would like to eventually use sagemaker and a :hugs: estimator to train the model.

I’m trying to build on the example from @philschmid in Huggingface Sagemaker - Vision Transformer but with my own dataset and the model from Fine-tuning DETR on a custom dataset for object detection by @nielsr .

If I understand correctly I need to create a dataset first and then save it in the session bucket on s3 but I am not entirely sure how to do that with a dataset which is too big to pull locally first in order to create it.

I have found the load_dataset function with the ‘imagefolder’ option which seems to do what I want for local image files but doesn’t seem to support filepaths on s3. I have also found the load_from_disk function which seems to do the loading for :hugs: datasets from s3 but doesn’t have an imagefolder option.

What is the best way to prepare my data in this case?

Thanks for the help!

Hello @cotrane,

you don’t need to save the dataset in the session bucket or do the pre-processing in advance. You could do everything inside sagemaker.

Meaning you can either have as first the the download of your dataset from s3 to local and the use load_dataset or just provide the S3 URI when calling to your bucket. Those S3 URIs don’t need to be on the session bucket they could be on any bucket.{'train': 'yourbucket'})

sagemaker will then download it for you and save it into /opt/ml/input/train

Ah ok - that makes sense! Thanks for the help. I’ll try that today!

I do have a similar question @philschmid.

Basically I have the same problem set up as cotrane, but I wanted to use the FastFile input mode, because of the size of my dataset. As I understand it, FastFile streams the data from S3 instead of downloading it all at once. Is there any way I can make that work together with the HF estimator/dataset approach?

@johko yes you can. The Hugging Face estimator and DLC support all known SageMaker features. Meaning you can use the File input mode for your training. Documentation can be found here: Access Training Data - Amazon SageMaker

Thank you.
I thought FastFile Mode was different from File Mode in terms of where and how the input data will be stored. But the figure on that page you linked makes it clearer for me :+1:

1 Like

Just a little addition to the path of the data, at least for me the path was /opt/ml/input/data/train so with an additional data folder in between.

But I guess safest way is to use the environment variables like SM_CHANNEL_TRAIN for the correct path

1 Like

Hi @philschmid. Apologies if this is answered and I just misread it, but is it possible to use load_dataset with imagefolder from s3 just like I would locally?

## load images from s3
import boto3
from sagemaker import get_execution_role

role = get_execution_role()
data_location = "s3/path/here"
dataset = load_dataset("imagefolder", data_dir=data_location)

I get the following error in SageMaker Studio despite it working locally:
FileNotFoundError: The directory at "s3/path/here" doesn't contain any data file

@cgpeltier-janes thats currently not possible. But there is some efforts on improving the integration with cloud-storages: Download and prepare as Parquet for cloud storage by lhoestq · Pull Request #4724 · huggingface/datasets · GitHub

1 Like

@philschmid the PR you have linked is merged, however as far as I can tell it does not contain support for imagefolder. This is a pretty important functionality, since as is I have 2 options:

  1. Download the entire dataset to SageMaker EFS, preprocess it, and save it to S3. This takes a lot of time and is inconvenient code-wise.
  2. Process data every time in HuggingFace Estimator script. This is very costly and time-consuming, since e.g. in hyperparameter optimization I would have to do this every time, in every estimator, and on GPU instance.

Would making a separate Github issue for this make sense in this case?

I basically want something like this, but without downloading everything from S3 manually.