CUDA error for inference on GPU instance

Hi,

We launch an instance like this

        hub = {
            'HF_MODEL_ID':  'xlm-roberta-large-finetuned-conll03-english',
            'HF_TASK': 'token-classification',
            'MMS_JOB_QUEUE_SIZE': '400',
        }

        # create Hugging Face Model Class
        huggingface_model = HuggingFaceModel(
            transformers_version='4.26.0',
            pytorch_version='1.13.1',
            py_version='py39',
            env=hub,
            role=self.role,
        )

        # deploy model to SageMaker Inference
        predictor = huggingface_model.deploy(
            initial_instance_count=1,
            instance_type='ml.g4dn.xlarge'
        )

Then we usually run inferences on that endpoint for days, sometimes weeks without any issues and then all of a sudden we start receiving this error. All requests start failing this way and the endpoint never recovers. We need to create new one and destroy the old one. We haven’t been able to link the error onset with any particular request message.

2023-05-11T04:35:05,418 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Prediction error
2023-05-11T04:35:05,418 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):
2023-05-11T04:35:05,418 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 234, in handle
2023-05-11T04:35:05,418 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     response = self.transform_fn(self.model, input_data, content_type, accept)
2023-05-11T04:35:05,418 [INFO ] W-9000-xlm-roberta-large-finetun com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 8
2023-05-11T04:35:05,418 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 190, in transform_fn
2023-05-11T04:35:05,418 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     predictions = self.predict(processed_data, model)
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 156, in predict
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     prediction = model(inputs, **parameters)
2023-05-11T04:35:05,419 [INFO ] W-9000-xlm-roberta-large-finetun ACCESS_LOG - /169.254.178.2:33918 "POST /invocations HTTP/1.1" 400 9
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/transformers/pipelines/token_classification.py", line 216, in __call__
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     return super().__call__(inputs, **kwargs)
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1084, in __call__
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1091, in run_single
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     model_outputs = self.forward(model_inputs, **forward_params)
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/transformers/pipelines/base.py", line 992, in forward
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     model_outputs = self._forward(model_inputs, **forward_params)
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/transformers/pipelines/token_classification.py", line 242, in _forward
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     logits = self.model(**model_inputs)[0]
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     return forward_call(*input, **kwargs)
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 1418, in forward
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     outputs = self.roberta(
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     return forward_call(*input, **kwargs)
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 853, in forward
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     encoder_outputs = self.encoder(
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     return forward_call(*input, **kwargs)
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 527, in forward
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     layer_outputs = layer_module(
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     return forward_call(*input, **kwargs)
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 412, in forward
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     self_attention_outputs = self.attention(
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     return forward_call(*input, **kwargs)
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 339, in forward
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     self_outputs = self.self(
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     return forward_call(*input, **kwargs)
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 196, in forward
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     mixed_query_layer = self.query(hidden_states)
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     return forward_call(*input, **kwargs)
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     return F.linear(input, self.weight, self.bias)
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())`
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - 
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - During handling of the above exception, another exception occurred:
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - 
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):
2023-05-11T04:35:05,419 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/mms/service.py", line 108, in predict
2023-05-11T04:35:05,420 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     ret = self._entry_point(input_batch, self.context)
2023-05-11T04:35:05,420 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.9/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 243, in handle
2023-05-11T04:35:05,420 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     raise PredictionException(str(e), 400)
2023-05-11T04:35:05,420 [INFO ] W-xlm-roberta-large-finetun-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - mms.service.PredictionException: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())` : 400

Does anyone hava any advice?

Regards,
Steliyan

Hello @steliyan,

This seems like a cuda/cublas error. Did that happen one time? Might be related to the “queue” size you defined.

Hi @philschmid

Thank you for your reply!
No, it has happened 7 or 8 times for a period of about 4 months. It happened a couple of times even before we defined MMS_JOB_QUEUE_SIZE so I don’t think it is the reason.

Kind regards,
Steliyan