Directly load models from a remote storage like S3

leifan · September 9, 2021, 6:29pm

Hi,
Instead of download the transformers model to the local file, could we directly read and write models from S3?
I have tested that we can read csv and txt files directly from S3, but not for models. Is there any solution?

philschmid · September 10, 2021, 6:22am

Hey @leifan,

Is your question related to inference or training?
If it is related to the question you could download a model from s3 when starting the training job the same way as you would do with data.

huggingface_estimator.fit({
    'train': 's3://<my-bucket>/<prefix>/train',  # containing train files
    'test': 's3://<my-bucket>/<prefix>/test',  # containing test files
    'model':  's3://<another-bucket>/<prefix>/model',  # containing model files (config.json, pytroch_model.bin, etc.)
})

SageMaker will then download when starting the training job all of these files into your container.
The Path of the files can either be accessed from the env var SM_CHANNEL_XXXX, e.g. SM_CHANNEL_TRAIN, SM_CHANNEL_MODEL or directly from, e.g. /opt/ml/input/train

And then you can load your model in your training script with

AutoModelForXXX.from_pretrained(os.environ.get('SM_CHANNEL_MODEL',None))

leifan · September 10, 2021, 4:31pm

Hi, @philschmid,

Thank you so much for your reply.
My question is related to the training process. I know huggingface has really nice functions for model deployment on SageMaker.

Let me clarify my use-case.

Currently I’m training transformer models (Huggingface) on SageMaker (AWS). I have to copy the model files from S3 buckets to SageMaker and copy the trained models back to S3 after training. This takes a lot of time especially when I have a lot of hyper-tuning experiments and the models are large. For example, if I have 10 trails and in each trail, I save 10 checkpoints. The total size for this single model is 10\times10\times4GB = 400GB (Each model is 3 to 5 GB).

As you can see, it takes a long time to transfer model files back and forth. So I’m considering if Pytorch allow reading and writing models directly from S3, so that we can skip the step of storing the files locally. This will greatly reduce the overall time it costs.

I heard from the tech support of my company that Pytorch does not support loading models from remote storage. It only allows loading models locally. I wish to confirm that and if it is true, could we build a new feature for remote storage and loading?

Thanks.

Best,

Lei

philschmid · September 13, 2021, 7:46am

Yes, there are at least three options on how to improve this.

Option 1: Use EFS/FSx instead of S3

Amazon SageMaker supports using Amazon Elastic File System (EFS) and FSx for Lustre as data sources to use during training.

https://sagemaker.readthedocs.io/en/stable/overview.html?highlight=efs#use-file-systems-as-training-inputs

That way you can continuously save your checkpoints and log files to the filesystem as than uploading it at the end to s3.

Option 2: Use S3 Checkpointing for uploads

After you enable checkpointing, SageMaker saves checkpoints to Amazon S3 and syncs your training job with the checkpoint S3 bucket.

When checkpointing is enabled sagemaker automatically asynchronously uploads every artifact written to checkpoint_local_path during Training.

Option 3: Use the Hugging Face Hub

You can use push_to_hub method to save your artifacts asynchronously to the hugging face hub into a repository.

alvations · November 18, 2022, 3:13pm

Is there an example of how to use the checkpointing with Huggingface estimator?

Would it look something like:

huggingface_estimator = HuggingFace(
        entry_point='train.py',
        source_dir='./scripts',
        instance_type='ml.p3.2xlarge',
        checkpoint_s3_uri="s3://mybucket/",
        instance_count=1,
        role=role,
        transformers_version='4.4',
        pytorch_version='1.6',
        py_version='py36',
        hyperparameters = hyperparameters
)

marshmellow77 · November 18, 2022, 4:20pm

Yes that’s right. The HF Estimator inherits from the Estimator class, so the usage would be the same.

Topic		Replies	Views
How to access to /opt/ml/model before the end of the model training? Amazon SageMaker	4	3970	December 9, 2021
Save and deploy distilbert model in AWS SageMaker 🤗Transformers	2	2642	April 9, 2021
Create a batch transform job with custom trained biobert model Amazon SageMaker	15	2068	February 22, 2022
Sagemaker downloads huggingface model image every time on running fit Amazon SageMaker	2	859	October 25, 2021
Incrementally finetuning a HF model in SageMaker Amazon SageMaker	6	915	May 4, 2022

Directly load models from a remote storage like S3

Related topics