Sagemaker Endpoint Not Using GPU for PygmalionAI


I created a sagemaker endpoint for Pygmalion AI by uploading a .tar.gz archive which contains all of the files in the repo, plus a new folder called “Code” which contains a requirements.txt file with the following.


I then run the following in order to create the inference:

from sagemaker.huggingface import HuggingFaceModel
import sagemaker

MODEL_LOCATION = 's3:{file_location}/pygmalion-6b.tar.gz'

hub = {

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
	initial_instance_count=1, # number of instances
	instance_type='ml.g4dn.xlarge' # ec2 instance type

msg = {
    "inputs": "Hey, what is your favorite color?",  

This pipeline works fine when using the Pygmalion 350m model, but times out on execution for the 6b model. The instance type I selected has a T4 GPU which should be powerful enough to run this model, but I always time out. When I look inside the endpoint monitor that I created, I see a spike in CPU usage but GPU usage stays at 0 for all executions.

Is there something that I am doing wrong to not have this work? I saw the this thread which was resolved by using a newer version of transformers. I have tried multiple different versions of transformers both within my requirements.txt file as well as the transformers_version that is in the HuggingFaceModel class and yet I cannot get Sagemaker to utilize the GPU on the instance.

Any help is greatly appreciated. Thank You!

Any ideas?

Can you share the logs? it might be possible that the model is not loaded correctly or runs out of memory and then requests always times out.
Here is an example on how to deploy GPT-J on SageMaker → Deploy GPT-J 6B for inference using Hugging Face Transformers and Amazon SageMaker

Hi Phil,

Thank you for the response! Here are the logs. Your thought about out of memory was also what I was thinking due to the error “Load model failed: model, error: Worker Died” but I was surprised to see that since the recommended amount of VRAM from the Pygmalion devs on their Cloud Deployment quickstart guide is 16gb.

Thanks a lot for sending over that guide for deploying GPT-J! That will be extremely helpful and I am going through the process of doing that now. By default, I believe Pygmalion is Float16 (since that’s what is listed in their config.json), but the instruction regarding loading the model with Pickle saving 12x the time seems extremely useful to know, so I will try setting that up now. Quick question regarding that point - Does the sagemaker 60s time out factor into the loading time from the first request? Meaning, it could take say 5s to get an inference but 70s to load the first time, and therefore would timeout on load even though the inference would be fast enough? And then after that load time, there would be no problems?

Thanks Again!

Yes it seems that your model dies when loading

1.68244E+12	2023-04-25T16:52:06,760 [WARN ] W-9000-model - Load model failed: model, error: Worker died.

Hi @philschmid,

I tried the method from your GPT-J guide and I am still having the model time out with working dying. Do you know why that is? The Pygmalion model is based on GPT-J so I can’t see why the GPT-J model works fine but the Pygmalion model would fail. Additionally, 16gb of VRAM should be more than sufficient for this model so I can’t figure out why it is dying on loading. Any advice would be appreciated.

Here is the python script that I used for setting up the model which was adapted from the GPT-J script. I load in pygmalion-6b and save it with model.eval().half() to get Float16 precision (I also loaded it in separately and verified that it is Float16 as well). Then I load in the tokenizer and zip as required and I manually upload it to S3 and run Sagemaker Studio code as shown.

Thank You!