LLM with 1048k hosted on sagemaker

Janekowal512 · September 11, 2024, 10:19am

I downloaded this Llama3 70B model (gradientai/Llama-3-70B-Instruct-Gradient-1048k · Hugging Face) which can handle 1048k context length.

I want to deploy this model to sagemaker real-time endpoint with ml.g5.48xlarge instance.

I succeeded with default 4096 tokens, but when I go above 10000 tokens, I’m getting out of memory error of this kind

“torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.40 GiB. GPU #004 has a total capacity of 22.20 GiB of which 1.94 GiB is free. Process 62462 has 20.26 GiB memory in use. Of the allocated memory 17.89 GiB is allocated by PyTorch, and 431.96 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (CUDA semantics — PyTorch 2.4 documentation)”

What can I do with sagemaker endpoint to use whole 1 048 000 context tokens?

PS: those are env variables I’m using for this model "
HF_MODEL_ID: “/opt/ml/model”
SM_NUM_GPUS: “8”
MESSAGES_API_ENABLED: “true”
MAX_INPUT_TOKENS: “4000”
MAX_TOTAL_TOKENS: “4096"”

Topic		Replies	Views
Running into OOM on GPU with quantized llama-3-8b for text generation inference Models	0	490	June 29, 2024
Getting CUDA memory error at endpoint - what are my options? Amazon SageMaker	5	3276	May 20, 2022
LLM ingores max_memory in inference Models	0	129	June 20, 2024
Llama 3.1 8b Instruct - Memory Usage More than Reported Models	5	433	February 18, 2025
VRAM Usage Differences in SageMaker Training Jobs vs. Direct Instance for Fine-Tuning LLama3 8B with QLoRA Amazon SageMaker	0	61	October 18, 2024

LLM with 1048k hosted on sagemaker

Related topics