How do I deploy a hub model to SageMaker and give it a GPU (not Elastic Inference)?

bobloki · February 14, 2022, 9:42pm

How do I deploy a hub model to SageMaker and give it a GPU?

hub = {
‘HF_MODEL_ID’:‘deepset/roberta-base-squad2’,
‘HF_TASK’:‘question-answering’
}

huggingface_model = HuggingFaceModel(
transformers_version=‘4.6.1’,
pytorch_version=‘1.7.1’,
py_version=‘py36’,
env=hub,
role=role,
)
predictor = huggingface_model.deploy(
initial_instance_count=1, # number of instances
instance_type=‘ml.g4dn.4xlarge’, # ec2 instance type
)

It doesn’t matter how many requests I throw at it, GPU Utilization in the Sagemaker Montior never goes above 0.

Any help or troubleshooting would be greatly appreciated.

marshmellow77 · February 15, 2022, 6:59am

I believe that you have to provide your own inference script if you want to leverage the GPU. This inference script needs to check if a GPU is available, i.e. it needs to contain a line like this:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Once you have detected the GPU you can call the Pipeline API with the device parameter, i.e.

pipeline("question-answering", model=model, tokenizer=tokenizer, device=device)

This notebook shows how to deploy a HF model using your own inference script by providing the entry_point and the source_dir parameters when calling HuggingFaceModel(): text-summarisation-project/4a_model_testing_deployed.ipynb at main · marshmellow77/text-summarisation-project · GitHub

philschmid · February 15, 2022, 7:31am

Hello @bobloki,

Normally the Inference Toolkit identifies if a GPU is available and uses it. See here: sagemaker-huggingface-inference-toolkit/transformers_utils.py at 7cb5009fef6566199ef47ed9ca2a3de4f81c0844 · aws/sagemaker-huggingface-inference-toolkit · GitHub

And we haven’t seen any other customer seeing this issue.

Could you please try to update the transformers version and test it again.
P.S. You can go with ml.g4dn.xlarge instead of ml.g4dn.4xlarge to save cost since both only use 1 GPU.

You can find a list of available versions here: Reference
below is your shared snippet using higher version

hub = {
	'HF_MODEL_ID':'deepset/roberta-base-squad2',
	'HF_TASK':'question-answering'
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
	transformers_version='4.12',
	pytorch_version='1.9',
	py_version='py36',
	env=hub,
	role=role, 
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
	initial_instance_count=1, # number of instances
	instance_type='ml.g4dn.xlarge' # ec2 instance type
)

OlivierCR · February 15, 2022, 10:40am

See below answer from @philschmid

In addition, note that modern GPU have large parallel compute abilities (2500+ cores in the T4 of the G4), and that it is tough to make them busy. The compute workload of few 1-record inference is thousand times smaller than training (that also does backprop in addition to forward, and batches compute), and it happens that running few inferences manually on the GPU does not make it busy enough to see activity in cloudwatch, that produces 1-min aggregates

bobloki · February 15, 2022, 4:34pm

@philschmid thank you!! This totally worked. The only thing I changed was the versions.

@OlivierCR I hear ya but I was sending quite a few requests. I can see a jump now.

Topic		Replies	Views
Issues using GPU with HuggingFace (TensorFlow) model deployed to SageMaker endpoint Amazon SageMaker	0	623	December 12, 2023
Sagemaker Endpoint Not Using GPU for PygmalionAI Amazon SageMaker	7	1821	April 18, 2024
HuggingFaceModel create fails with no GPU Amazon SageMaker	3	25	June 14, 2025
Deploying Huggingface Sagemaker Models with Elastic Inference Amazon SageMaker	21	4222	November 8, 2022
Help for inference.py code Amazon SageMaker	10	4003	March 8, 2022

How do I deploy a hub model to SageMaker and give it a GPU (not Elastic Inference)?

Related topics