LLM with 1048k hosted on sagemaker

I downloaded this Llama3 70B model (gradientai/Llama-3-70B-Instruct-Gradient-1048k · Hugging Face) which can handle 1048k context length.

I want to deploy this model to sagemaker real-time endpoint with ml.g5.48xlarge instance.

I succeeded with default 4096 tokens, but when I go above 10000 tokens, I’m getting out of memory error of this kind

“torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.40 GiB. GPU #004 has a total capacity of 22.20 GiB of which 1.94 GiB is free. Process 62462 has 20.26 GiB memory in use. Of the allocated memory 17.89 GiB is allocated by PyTorch, and 431.96 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (CUDA semantics — PyTorch 2.4 documentation)”

What can I do with sagemaker endpoint to use whole 1 048 000 context tokens?

PS: those are env variables I’m using for this model "
HF_MODEL_ID: “/opt/ml/model”
SM_NUM_GPUS: “8”
MESSAGES_API_ENABLED: “true”
MAX_INPUT_TOKENS: “4000”
MAX_TOTAL_TOKENS: “4096"”