Hi,
My question is about model loading : how is it possible that one model can be loaded on gpu while another one of same number of params (even a very little bit lighter on disk) cannot be loaded?
For a concrete case, i’m trying to load a “Text Generation” model on Free Google Colab (12.7Go RAM and GPU 15GoRAM) using:
import torch
from transformers import AutoTokenizer
import transformers
model = "tiiuae/falcon-7b-instruct"
# model = "bigscience/bloomz-7b1-mt"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
trust_remote_code=True,
device_map="auto",
)
While it works for this model (9.95 + 4.48 = 14.43Go
checkpoints on disk) it does not work for model = "bigscience/bloomz-7b1-mt"
(14.10Go
single checkpoint on disk), exceeding the 12.7Go RAM : what is the root cause? is it because the first model is a two-sharded model?
So i moved to cpu offload strategy for the second model using :
import torch
from transformers import BitsAndBytesConfig, AutoTokenizer
import transformers
model = "tiiuae/falcon-7b-instruct"
# model = "bigscience/bloomz-7b1-mt"
quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True)
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
model_kwargs = {"quantization_config": quantization_config}
)
Here, the two models loading result in memory exceeding, which disturbs me more because i expected the ~14/15Go model weights to be dispatched between 12.7Go RAM and 15Go GPU RAM… Is it because device=
map`` does not dispatch model components properly between cpu and gpu?
Thanks a lot for any link or discussion to help me understand what happens under the hood.
Regards