CUDA error when deploying model with custom inference

cvetanovskaa · February 21, 2024, 9:21pm

Hi everyone! I am trying to deploy openchat 3.5 (7B) with custom inference, but I’ve been running into CUDA issues. I followed @philschmid’s guide on creating the custom inference: Creating document embeddings with Hugging Face's Transformers & Amazon SageMaker.

My “model_fn” function looks like such:

def model_fn(model_dir, arg2 = None):
    global device
    device = 'cuda' if torch.cuda.is_available() == True else 'cpu'
    
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    model = AutoModelForCausalLM.from_pretrained(model_dir, cache_dir="/tmp/model_cache/").to(device)
    model.eval()

    return model, tokenizer

and the transform_fn function just calls the tokenizer, model.generate with output_scores = True, and finally I run a softmax function on the relevant logits.

I have tried all the way to g5.12xlarge instance but keeps CUDA-ing. According to the logs it does it on AutoModelForCausalLM.from_pretrained. Does anyone have an idea on why this is happening? The model is only 7B it should be able to fit on this instance. I am not sure what other params, or optimizations I need to add and how.

Error:

An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
  "code": 400,
  "type": "InternalServerException",
  "message": "CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacty of 22.20 GiB of which 89.12 MiB is free. Process 12483 has 22.11 GiB memory in use. Of the allocated memory 21.38 GiB is allocated by PyTorch, and 7.17 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF"
}

Model deployment:

huggingface_model = HuggingFaceModel(
	model_data="s3://FOLDERt/model.tar.gz",
	role=role,
    transformers_version="4.37",  # transformers version used
    pytorch_version="2.1",        # pytorch version used
    py_version='py310',
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
	initial_instance_count=1,
    role=role,
    instance_type="ml.g5.12xlarge",
	container_startup_health_check_timeout=300,
    tags = tags
  )

Thanks in advance!!

Topic		Replies	Views
Sagemaker HuggingFaceModel crashed with CUDA error Amazon SageMaker	3	1260	February 20, 2025
Getting CUDA memory error at endpoint - what are my options? Amazon SageMaker	5	3293	May 20, 2022
CUDA error for inference on GPU instance Amazon SageMaker	2	764	May 16, 2023
Inference failed for FLAN-UL2(20B) on SageMaker Amazon SageMaker	6	2162	April 4, 2023
Sagemaker Serverless Inference Amazon SageMaker	22	9010	May 22, 2024

CUDA error when deploying model with custom inference

Related topics