How is memory managed when loading a model?


I used the transformers library in the past, however I didn’t really understand how a model is loaded in memory. Recently, I tried using a larger model than usual on my computer using only the CPU as I do not have a proper GPU yet.

I used the following code to load the model and test inference time:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B")

# ask input in terminal
prompt = input("Model prompt >>> ")

# encode input
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# generate output
output = model.generate(input_ids, max_length=200, do_sample=True, top_p=0.95, top_k=60)

# decode output
print("Output >>> " + tokenizer.decode(output[0], skip_special_tokens=True))

I expected the model to use approximately 24Gb of memory based on the number of parameters, but as the model started loading, I noticed that it used way more, reaching about 40Gb. When the model finished loading however, memory usage came back down to around 24Gb as expected.

I don’t really understand how memory is handled while loading a model. I know it has something to do with the way transformers use PyTorch or TensorFlow, but I can’t find a detailed loading process that can explain my findings.

Does anyone know the process in which model data is loaded in memory and maybe have some resources on this topic?


See this guide: Handling big models for inference. Basically we first create the model with randomly initialized weights, then equip the model with the pre-trained ones.

I see, thanks for the quick reply!