Directly load models from a remote storage like S3

Hi,
Instead of download the transformers model to the local file, could we directly read and write models from S3?
I have tested that we can read csv and txt files directly from S3, but not for models. Is there any solution?

Hey @leifan,

Is your question related to inference or training?
If it is related to the question you could download a model from s3 when starting the training job the same way as you would do with data.

huggingface_estimator.fit({
    'train': 's3://<my-bucket>/<prefix>/train',  # containing train files
    'test': 's3://<my-bucket>/<prefix>/test',  # containing test files
    'model':  's3://<another-bucket>/<prefix>/model',  # containing model files (config.json, pytroch_model.bin, etc.)
})

SageMaker will then download when starting the training job all of these files into your container.
The Path of the files can either be accessed from the env var SM_CHANNEL_XXXX, e.g. SM_CHANNEL_TRAIN, SM_CHANNEL_MODEL or directly from, e.g. /opt/ml/input/train

And then you can load your model in your training script with

AutoModelForXXX.from_pretrained(os.environ.get('SM_CHANNEL_MODEL',None))

Hi, @philschmid,

Thank you so much for your reply.
My question is related to the training process. I know huggingface has really nice functions for model deployment on SageMaker.

Let me clarify my use-case.

Currently I’m training transformer models (Huggingface) on SageMaker (AWS). I have to copy the model files from S3 buckets to SageMaker and copy the trained models back to S3 after training. This takes a lot of time especially when I have a lot of hyper-tuning experiments and the models are large. For example, if I have 10 trails and in each trail, I save 10 checkpoints. The total size for this single model is 10\times10\times4GB = 400GB (Each model is 3 to 5 GB).

As you can see, it takes a long time to transfer model files back and forth. So I’m considering if Pytorch allow reading and writing models directly from S3, so that we can skip the step of storing the files locally. This will greatly reduce the overall time it costs.

I heard from the tech support of my company that Pytorch does not support loading models from remote storage. It only allows loading models locally. I wish to confirm that and if it is true, could we build a new feature for remote storage and loading?

Thanks.

Best,

Lei

Yes, there are at least three options on how to improve this.

Option 1: Use EFS/FSx instead of S3

Amazon SageMaker supports using Amazon Elastic File System (EFS) and FSx for Lustre as data sources to use during training.

https://sagemaker.readthedocs.io/en/stable/overview.html?highlight=efs#use-file-systems-as-training-inputs

That way you can continuously save your checkpoints and log files to the filesystem as than uploading it at the end to s3.

Option 2: Use S3 Checkpointing for uploads

After you enable checkpointing, SageMaker saves checkpoints to Amazon S3 and syncs your training job with the checkpoint S3 bucket.

https://docs.aws.amazon.com/sagemaker/latest/dg/model-checkpoints.html#model-checkpoints-enable

When checkpointing is enabled sagemaker automatically asynchronously uploads every artifact written to checkpoint_local_path during Training.

Option 3: Use the Hugging Face Hub

You can use push_to_hub method to save your artifacts asynchronously to the hugging face hub into a repository.