How much memory required to load T0pp

Hi, I’m trying to load the T0pp model (49GB). However, after quite a while, the system threw a read error. I suppose it’s because the memory of my machine is not enough to load it. Any info about how much memory is required to load the model? Or is there any trick to go around it? Thank you very much.

1 Like

Hi,

Looking at the model repo, it seems to be 41.5 GB. Actually you need twice as much CPU RAM in order to load the model. When calling .from_pretrained(), the model actually gets loaded twice: once with randomly initialized weights, once with the pretrained weights. However, @stas has added a new (experimental) argument called low_cpu_mem_usage, which can be set to True, in order to only load the model once into CPU memory (directly with the pretrained weights), see this PR. So using that argument, it requires at least 41.5 GB of CPU RAM.

Next, if you want to perform inference on GPU, you also need at least the same amount of GPU RAM (41.5 GB) in order to put the model on it, + you need some extra space for the data you put on it, as well as the activations (i.e. logits).

2 Likes

Additionally consider using deepspeed with offload enabled:

This model is the same in size as t5-11b so the same setup applies to using t0. e.g. here is info how to load it on a single 40GB gpu for finetuning:

Performance and Scalability: How To Fit a Bigger Model and Train It Faster — transformers 4.12.0.dev0 documentation will further reduce memory usage.

re deepspeed usage for this model, here is the breakdown for 1 gpu (everything but the activations memory):

python -c 'from transformers import AutoModel; \
from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live; \
model = AutoModel.from_pretrained("bigscience/T0pp"); \
estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=1, num_nodes=1)'
2021-10-19 09:05:09.964152: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] 
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 1 GPU per node.
SW: Model with 11003M total params, 131M largest layer params.
  per CPU  |  per GPU |   Options
  276.70GB |   0.49GB | cpu_offload=1, cpu_offload_params=1, zero_init=1
  276.70GB |   0.49GB | cpu_offload=1, cpu_offload_params=1, zero_init=0
  245.95GB |  20.99GB | cpu_offload=1, cpu_offload_params=0, zero_init=1
  245.95GB |  20.99GB | cpu_offload=1, cpu_offload_params=0, zero_init=0
    0.74GB | 184.95GB | cpu_offload=0, cpu_offload_params=0, zero_init=1
   61.49GB | 184.95GB | cpu_offload=0, cpu_offload_params=0, zero_init=0

So you can see that either you need a huge amount of RAM (and you can also use nvme for offloading!) and then any tiny gpu will do or you can use it on a 40GB and bigger single gpu.

Change num_gpus_per_node=1 to your number of gpus to get the estimate for your setup.

and remember there will be needed more memory for activations, which would depend on the batch size and seqlen.

3 Likes

Thanks, nielsr. That’s very helpful. I’ll give it a try.

Thanks sir, very very helpful. I’ll try to run deep speed to see it works.