How do I deploy a hub model to SageMaker and give it a GPU (not Elastic Inference)?

How do I deploy a hub model to SageMaker and give it a GPU?

hub = {
‘HF_MODEL_ID’:‘deepset/roberta-base-squad2’,
‘HF_TASK’:‘question-answering’
}

huggingface_model = HuggingFaceModel(
transformers_version=‘4.6.1’,
pytorch_version=‘1.7.1’,
py_version=‘py36’,
env=hub,
role=role,
)
predictor = huggingface_model.deploy(
initial_instance_count=1, # number of instances
instance_type=‘ml.g4dn.4xlarge’, # ec2 instance type
)

It doesn’t matter how many requests I throw at it, GPU Utilization in the Sagemaker Montior never goes above 0.

Any help or troubleshooting would be greatly appreciated. :slight_smile:

I believe that you have to provide your own inference script if you want to leverage the GPU. This inference script needs to check if a GPU is available, i.e. it needs to contain a line like this:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Once you have detected the GPU you can call the Pipeline API with the device parameter, i.e.

pipeline("question-answering", model=model, tokenizer=tokenizer, device=device)

This notebook shows how to deploy a HF model using your own inference script by providing the entry_point and the source_dir parameters when calling HuggingFaceModel(): text-summarisation-project/4a_model_testing_deployed.ipynb at main · marshmellow77/text-summarisation-project · GitHub

Hello @bobloki,

Normally the Inference Toolkit identifies if a GPU is available and uses it. See here: sagemaker-huggingface-inference-toolkit/transformers_utils.py at 7cb5009fef6566199ef47ed9ca2a3de4f81c0844 · aws/sagemaker-huggingface-inference-toolkit · GitHub

And we haven’t seen any other customer seeing this issue.

Could you please try to update the transformers version and test it again.
P.S. You can go with ml.g4dn.xlarge instead of ml.g4dn.4xlarge to save cost since both only use 1 GPU.

You can find a list of available versions here: Reference
below is your shared snippet using higher version

hub = {
	'HF_MODEL_ID':'deepset/roberta-base-squad2',
	'HF_TASK':'question-answering'
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
	transformers_version='4.12',
	pytorch_version='1.9',
	py_version='py36',
	env=hub,
	role=role, 
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
	initial_instance_count=1, # number of instances
	instance_type='ml.g4dn.xlarge' # ec2 instance type
)
2 Likes

See below answer from @philschmid

In addition, note that modern GPU have large parallel compute abilities (2500+ cores in the T4 of the G4), and that it is tough to make them busy. The compute workload of few 1-record inference is thousand times smaller than training (that also does backprop in addition to forward, and batches compute), and it happens that running few inferences manually on the GPU does not make it busy enough to see activity in cloudwatch, that produces 1-min aggregates

1 Like

@philschmid thank you!! This totally worked. The only thing I changed was the versions.

@OlivierCR I hear ya but I was sending quite a few requests. I can see a jump now.

2 Likes