Issues using GPU with HuggingFace (TensorFlow) model deployed to SageMaker endpoint

jenpeper · December 12, 2023, 11:10pm

I successfully fined tuned a HuggingFace model (distilbert-base-uncased) for a multi-classification problem using a SageMaker training job. After training, I deployed the model to a live SageMaker endpoint with instance type ml.g5.2xlarge (I also tried ml.p3.2xlarge). Initial tests showed that the model worked fine after deployment. However, GPU utilization never registered over 0 for the endpoint, and further inspection of the logs showed that the GPU was not registered during endpoint creation.

Any ideas on how to fix this? Thanks!

I am using the following package version:
transformers_version=“4.17”,
pytorch_version=“1.10.2”,
tensorflow_version=“2.6”,
py_version=“py38”

Here are my logs from the endpoint creation:

2023-12-12 21:39:14.926292: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.
2023-12-12 21:39:14.926404: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.
2023-12-12 21:39:14.969866: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.
Warning: MMS is using non-default JVM parameters: -XX:-UseContainerSupport
2023-12-12T21:39:17,674 [INFO ] main com.amazonaws.ml.mms.ModelServer - 
MMS Home: /opt/conda/lib/python3.8/site-packages
Current directory: /
Temp directory: /home/model-server/tmp
Number of GPUs: 1
Number of CPUs: 8
Max heap size: 7068 M
Python executable: /opt/conda/bin/python3.8
Config file: /etc/sagemaker-mms.properties
Inference address: http://0.0.0.0:8080
Management address: http://0.0.0.0:8080
Model Store: /.sagemaker/mms/models
Initial Models: ALL
Log dir: null
Metrics dir: null
Netty threads: 0
Netty client threads: 0
Default workers per model: 1
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Preload model: false
Prefer direct buffer: false
2023-12-12T21:39:17,713 [WARN ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-9000-model
2023-12-12T21:39:17,764 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - model_service_worker started with args: --sock-type unix --sock-name /home/model-server/tmp/.mms.sock.9000 --handler sagemaker_huggingface_inference_toolkit.handler_service --model-path /.sagemaker/mms/models/model --model-name model --preload-model false --tmp-dir /home/model-server/tmp
2023-12-12T21:39:17,764 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Listening on port: /home/model-server/tmp/.mms.sock.9000
2023-12-12T21:39:17,764 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - [PID] 54
2023-12-12T21:39:17,765 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - MMS worker started.
2023-12-12T21:39:17,765 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Python runtime: 3.8.10
2023-12-12T21:39:17,765 [INFO ] main com.amazonaws.ml.mms.wlm.ModelManager - Model model loaded.
2023-12-12T21:39:17,768 [INFO ] main com.amazonaws.ml.mms.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2023-12-12T21:39:17,774 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.mms.sock.9000
2023-12-12T21:39:17,814 [INFO ] main com.amazonaws.ml.mms.ModelServer - Inference API bind to: http://0.0.0.0:8080
Model server started.
2023-12-12T21:39:17,816 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9000.
2023-12-12T21:39:17,817 [WARN ] pool-3-thread-1 com.amazonaws.ml.mms.metrics.MetricCollector - worker pid is not available yet.
2023-12-12T21:39:18,302 [WARN ] W-9000-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - 2023-12-12 21:39:18.301838: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.
2023-12-12T21:39:18,302 [WARN ] W-9000-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - 2023-12-12 21:39:18.301947: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.
2023-12-12T21:39:18,343 [WARN ] W-9000-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - 2023-12-12 21:39:18.343318: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.
2023-12-12T21:39:18,520 [INFO ] pool-2-thread-3 ACCESS_LOG - /169.254.178.2:37850 "GET /ping HTTP/1.1" 200 8
2023-12-12T21:39:20,511 [WARN ] W-9000-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - 2023-12-12 21:39:20.511326: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-12-12T21:39:20,512 [WARN ] W-9000-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - 2023-12-12 21:39:20.512388: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib/:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/lib:/home/.openmpi/lib/
2023-12-12T21:39:20,513 [WARN ] W-9000-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - 2023-12-12 21:39:20.512475: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib/:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/lib:/home/.openmpi/lib/
2023-12-12T21:39:20,513 [WARN ] W-9000-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - 2023-12-12 21:39:20.512538: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib/:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/lib:/home/.openmpi/lib/
2023-12-12T21:39:20,513 [WARN ] W-9000-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - 2023-12-12 21:39:20.512601: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcurand.so.10'; dlerror: libcurand.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib/:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/lib:/home/.openmpi/lib/
2023-12-12T21:39:20,514 [WARN ] W-9000-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - 2023-12-12 21:39:20.512662: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib/:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/lib:/home/.openmpi/lib/
2023-12-12T21:39:20,514 [WARN ] W-9000-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - 2023-12-12 21:39:20.512745: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib/:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/lib:/home/.openmpi/lib/
2023-12-12T21:39:20,515 [WARN ] W-9000-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - 2023-12-12 21:39:20.512849: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib/:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/lib:/home/.openmpi/lib/
2023-12-12T21:39:20,515 [WARN ] W-9000-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - 2023-12-12 21:39:20.512864: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1835] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
2023-12-12T21:39:20,515 [WARN ] W-9000-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Skipping registering GPU devices...

Topic		Replies	Views
Sagemaker Endpoint Not Using GPU for PygmalionAI Amazon SageMaker	7	1804	April 18, 2024
HuggingFaceModel create fails with no GPU Amazon SageMaker	3	23	June 14, 2025
How do I deploy a hub model to SageMaker and give it a GPU (not Elastic Inference)? Amazon SageMaker	4	3370	February 15, 2022
Sagemaker Serverless Inference Amazon SageMaker	22	8996	May 22, 2024
How to deploy Sagemaker Multi-model Endpoints on GPU? Amazon SageMaker	0	386	December 14, 2023

Issues using GPU with HuggingFace (TensorFlow) model deployed to SageMaker endpoint

Related topics