Hi,
I used the transformers library in the past, however I didn’t really understand how a model is loaded in memory. Recently, I tried using a larger model than usual on my computer using only the CPU as I do not have a proper GPU yet.
I used the following code to load the model and test inference time:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B")
# ask input in terminal
prompt = input("Model prompt >>> ")
# encode input
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
# generate output
output = model.generate(input_ids, max_length=200, do_sample=True, top_p=0.95, top_k=60)
# decode output
print("Output >>> " + tokenizer.decode(output[0], skip_special_tokens=True))
I expected the model to use approximately 24Gb of memory based on the number of parameters, but as the model started loading, I noticed that it used way more, reaching about 40Gb. When the model finished loading however, memory usage came back down to around 24Gb as expected.
I don’t really understand how memory is handled while loading a model. I know it has something to do with the way transformers use PyTorch or TensorFlow, but I can’t find a detailed loading process that can explain my findings.
Does anyone know the process in which model data is loaded in memory and maybe have some resources on this topic?
Thanks!