General question about large model loading

Hi,

My question is about model loading : how is it possible that one model can be loaded on gpu while another one of same number of params (even a very little bit lighter on disk) cannot be loaded?
For a concrete case, i’m trying to load a “Text Generation” model on Free Google Colab (12.7Go RAM and GPU 15GoRAM) using:

import torch
from transformers import AutoTokenizer
import transformers


model = "tiiuae/falcon-7b-instruct"
# model = "bigscience/bloomz-7b1-mt"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    trust_remote_code=True,
    device_map="auto",
)

While it works for this model (9.95 + 4.48 = 14.43Go checkpoints on disk) it does not work for model = "bigscience/bloomz-7b1-mt" (14.10Go single checkpoint on disk), exceeding the 12.7Go RAM : what is the root cause? is it because the first model is a two-sharded model?

So i moved to cpu offload strategy for the second model using :

import torch
from transformers import BitsAndBytesConfig, AutoTokenizer
import transformers


model = "tiiuae/falcon-7b-instruct"
# model = "bigscience/bloomz-7b1-mt"

quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True)

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
    model_kwargs = {"quantization_config": quantization_config}
)

Here, the two models loading result in memory exceeding, which disturbs me more because i expected the ~14/15Go model weights to be dispatched between 12.7Go RAM and 15Go GPU RAM… Is it because device=map`` does not dispatch model components properly between cpu and gpu?

Thanks a lot for any link or discussion to help me understand what happens under the hood.

Regards

Yes, as we need the model sharded to load only a chunk of it in.