General question about large model loading

ykacer · June 13, 2023, 12:24pm

Hi,

My question is about model loading : how is it possible that one model can be loaded on gpu while another one of same number of params (even a very little bit lighter on disk) cannot be loaded?
For a concrete case, i’m trying to load a “Text Generation” model on Free Google Colab (12.7Go RAM and GPU 15GoRAM) using:

import torch
from transformers import AutoTokenizer
import transformers


model = "tiiuae/falcon-7b-instruct"
# model = "bigscience/bloomz-7b1-mt"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    trust_remote_code=True,
    device_map="auto",
)

While it works for this model (9.95 + 4.48 = 14.43Go checkpoints on disk) it does not work for model = "bigscience/bloomz-7b1-mt" (14.10Go single checkpoint on disk), exceeding the 12.7Go RAM : what is the root cause? is it because the first model is a two-sharded model?

So i moved to cpu offload strategy for the second model using :

import torch
from transformers import BitsAndBytesConfig, AutoTokenizer
import transformers


model = "tiiuae/falcon-7b-instruct"
# model = "bigscience/bloomz-7b1-mt"

quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True)

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
    model_kwargs = {"quantization_config": quantization_config}
)

Here, the two models loading result in memory exceeding, which disturbs me more because i expected the ~14/15Go model weights to be dispatched between 12.7Go RAM and 15Go GPU RAM… Is it because device=map`` does not dispatch model components properly between cpu and gpu?

Thanks a lot for any link or discussion to help me understand what happens under the hood.

Regards

muellerzr · June 13, 2023, 12:45pm

Yes, as we need the model sharded to load only a chunk of it in.

hhhmichel · November 28, 2024, 11:09am

hi ykacer, did you find a way to load the model ? I am facing the same issue as you but with collab pro

Topic		Replies	Views
How to load large model with multiple GPU cards? Beginners	8	43413	October 25, 2023
How is memory managed when loading a model? Beginners	2	6189	July 4, 2023
How can I load large models like google/mt5-xl on a GPU Models	2	1742	April 30, 2022
Loading model directly to GPU omitting RAM Beginners	6	63	March 28, 2025
Can't load huge model onto multiple GPU's Beginners	5	5182	June 15, 2023

General question about large model loading

Related topics