How is memory managed when loading a model?

amon-epa · July 3, 2023, 10:17am

Hi,

I used the transformers library in the past, however I didn’t really understand how a model is loaded in memory. Recently, I tried using a larger model than usual on my computer using only the CPU as I do not have a proper GPU yet.

I used the following code to load the model and test inference time:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B")

# ask input in terminal
prompt = input("Model prompt >>> ")

# encode input
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# generate output
output = model.generate(input_ids, max_length=200, do_sample=True, top_p=0.95, top_k=60)

# decode output
print("Output >>> " + tokenizer.decode(output[0], skip_special_tokens=True))

I expected the model to use approximately 24Gb of memory based on the number of parameters, but as the model started loading, I noticed that it used way more, reaching about 40Gb. When the model finished loading however, memory usage came back down to around 24Gb as expected.

I don’t really understand how memory is handled while loading a model. I know it has something to do with the way transformers use PyTorch or TensorFlow, but I can’t find a detailed loading process that can explain my findings.

Does anyone know the process in which model data is loaded in memory and maybe have some resources on this topic?
Thanks!

nielsr · July 3, 2023, 12:54pm

Hi,

See this guide: Handling big models for inference. Basically we first create the model with randomly initialized weights, then equip the model with the pre-trained ones.

amon-epa · July 4, 2023, 12:10pm

I see, thanks for the quick reply!

Topic		Replies	Views
Double expected memory usage Beginners	1	1434	August 17, 2022
How can I load large models like google/mt5-xl on a GPU Models	2	1749	April 30, 2022
Is model stored in free RAM or available RAM? Beginners	0	177	June 17, 2024
How to minimize memory consume when loading from pretrained models? 🤗Transformers	0	349	October 9, 2023
Loading of a model takes much RAM, passing to CUDA doesn't free RAM 🤗Transformers	0	790	August 8, 2021

How is memory managed when loading a model?

Related topics