I believe that you have to provide your own inference script if you want to leverage the GPU. This inference script needs to check if a GPU is available, i.e. it needs to contain a line like this:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
Once you have detected the GPU you can call the Pipeline API with the device parameter, i.e.
And we haven’t seen any other customer seeing this issue.
Could you please try to update the transformers version and test it again.
P.S. You can go with ml.g4dn.xlarge instead of ml.g4dn.4xlarge to save cost since both only use 1 GPU.
You can find a list of available versions here: Reference
below is your shared snippet using higher version
hub = {
'HF_MODEL_ID':'deepset/roberta-base-squad2',
'HF_TASK':'question-answering'
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
transformers_version='4.12',
pytorch_version='1.9',
py_version='py36',
env=hub,
role=role,
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1, # number of instances
instance_type='ml.g4dn.xlarge' # ec2 instance type
)
In addition, note that modern GPU have large parallel compute abilities (2500+ cores in the T4 of the G4), and that it is tough to make them busy. The compute workload of few 1-record inference is thousand times smaller than training (that also does backprop in addition to forward, and batches compute), and it happens that running few inferences manually on the GPU does not make it busy enough to see activity in cloudwatch, that produces 1-min aggregates