I believe that you have to provide your own inference script if you want to leverage the GPU. This inference script needs to check if a GPU is available, i.e. it needs to contain a line like this:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
Once you have detected the GPU you can call the Pipeline API with the device
parameter, i.e.
pipeline("question-answering", model=model, tokenizer=tokenizer, device=device)
This notebook shows how to deploy a HF model using your own inference script by providing the entry_point
and the source_dir
parameters when calling HuggingFaceModel()
: text-summarisation-project/4a_model_testing_deployed.ipynb at main · marshmellow77/text-summarisation-project · GitHub