Memory increasing after hugging face generate method

sleepywalker · November 24, 2024, 3:15am

Hello! I wanted to make an inference with codegemma model from huggingface, but when I use model.generate(**inputs) method GPU memory cost increases from 39 GB to 49 GB in peak usage with torch profiler no matter max_token_len number. I understand that we need to save activations of model on inference and context of like 4096 input tokens but I can’t believe that it can increase inference memory usage on 10 GB. Can someone explain me how could it be? Thank you in advance.
P.S. I attached below profiler of input of 330, 2700 and 8600 tokens respectively

Topic		Replies	Views
Accelerating inference for local HuggingFacePipeline of Llama3 🤗Transformers	0	89	August 1, 2024
Memory Usage for Inference Depending on Size of Input Data 🤗Transformers	1	4438	September 18, 2023
Which model for inference on 11 GB GPU? Beginners	1	396	October 30, 2021
GPU usage increasing every loop when running inference Beginners	2	1064	May 13, 2024
GPU memory GPTJ inference 🤗Transformers	0	237	June 13, 2023

Memory increasing after hugging face generate method

Related topics