CUDA error when deploying model with custom inference

Hi everyone! I am trying to deploy openchat 3.5 (7B) with custom inference, but I’ve been running into CUDA issues. I followed @philschmid’s guide on creating the custom inference: Creating document embeddings with Hugging Face's Transformers & Amazon SageMaker.

My “model_fn” function looks like such:

def model_fn(model_dir, arg2 = None):
    global device
    device = 'cuda' if torch.cuda.is_available() == True else 'cpu'
    
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    model = AutoModelForCausalLM.from_pretrained(model_dir, cache_dir="/tmp/model_cache/").to(device)
    model.eval()

    return model, tokenizer

and the transform_fn function just calls the tokenizer, model.generate with output_scores = True, and finally I run a softmax function on the relevant logits.

I have tried all the way to g5.12xlarge instance but keeps CUDA-ing. According to the logs it does it on AutoModelForCausalLM.from_pretrained. Does anyone have an idea on why this is happening? The model is only 7B it should be able to fit on this instance. I am not sure what other params, or optimizations I need to add and how.

Error:

An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
  "code": 400,
  "type": "InternalServerException",
  "message": "CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacty of 22.20 GiB of which 89.12 MiB is free. Process 12483 has 22.11 GiB memory in use. Of the allocated memory 21.38 GiB is allocated by PyTorch, and 7.17 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF"
}

Model deployment:

huggingface_model = HuggingFaceModel(
	model_data="s3://FOLDERt/model.tar.gz",
	role=role,
    transformers_version="4.37",  # transformers version used
    pytorch_version="2.1",        # pytorch version used
    py_version='py310',
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
	initial_instance_count=1,
    role=role,
    instance_type="ml.g5.12xlarge",
	container_startup_health_check_timeout=300,
    tags = tags
  )

Thanks in advance!!