Is model stored in free RAM or available RAM?

Hi all,

I’m trying to get a handle on where huggingface transformers library stores a model when I load it and use it for inference. I’m on Ubuntu 22.04 LTS, Python 3.10, transformers 4.41.2.

I’ve observed the following. First, to monitor memory usage, in a terminal I run:

watch -n 5 free -m

As you’d typically expect at this point, free memory is a small number and available memory is a large number. Next, in a terminal, I run:

echo 3 | sudo tee /proc/sys/vm/drop_caches

Now free memory and available memory are both (roughly) the same large number.
Next I open up python3 in a terminal, and in python, I run:

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import pipeline
model_save_path = "/home/colin/LLMs/phi-3"
tokenizer = AutoTokenizer.from_pretrained(model_save_path, local_files_only=True, device_map="cpu")
model = AutoModelForCausalLM.from_pretrained(model_save_path, local_files_only=True, device_map="cpu")

to load phi-3 from a locally stored location. (I should clarify, this is Phi-3-mini-4k-instruct which I was expecting to be somewhere around 15 gigabytes).

Free memory decreases by 100 megabytes or so, but obviously nowhere near the size of the model. Okay, fair enough, I’m guessing at this point that the framework for the model has been initialized but is still empty, even though based on the docs here I was expecting it to be initialized with random weights and then populated with actual weights. But all good. Next I decide to use the model for inference:

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
generation_args = {"max_new_tokens": 20}
x1 = pipe("What is the biggest public financial institution in Australia?", **generation_args)

Okay, it takes about 10 seconds to execute, and during that time, free memory drops by about 15 gigabytes, but available memory does not change at all. It seems that the OS never actually explicitly allocates RAM for the parameter values.

Next, I repeat all the above steps, but this time I set device_map="auto". Now, the from_pretrained line causes free memory to drop by about 8 gigabytes. I think I know what happened here, because I also have an RTX-3080 in this computer, and nvidia-smi shows the VRAM is close to full after this command. Okay, so it appears from_pretrained loaded 8 gigabytes into RAM and 8 gigabytes in VRAM. Next I run the pipe command, and free memory drops by another 8 gigabytes. It feels like under-the-hood it decided not to use the GPU at the last minute and switched to CPU for everything. Oh, and doing it this way takes about twice as long to run the pipe command.

Okay, so my question is, does anyone have any idea what is happening when I run these commands? My goal is to just gain a basic understanding of how these various commands decide where to allocate memory and why.

Thanks in advance to any responders.

Colin