Prerequisite to run bloom locally?

Can any one tell me, how much ram, gpu ram, and disk space is required to run bloom locally. I have tried to run ir and it has downloaded 180gb of data and still its in download process. so if it finish what are chances to run it locally?
I have rtx 3070

I didn’t have a chance to try it yet, but I’ve read in the official Slack channel that it requires something like 8*80GB A100 or 16*40GB A100 to perform inference locally.

And according to the “how to use” section from the model card you should not only have transformers installed but also the accelerate library.

PS: Check out this quantized version of BLOOM if the original model doesn’t fit in your hardware.

You can run it on less than this, as long as you have enough disk space (and plenty of time to wait) as Accelerate automatically offloads weights on the CPU if there is no more space on the GPU, and then on the disk if there is no more CPU RAM.

For your reference, I can run on 8*48GB A6000 GPU to perform inference locally, using Accelerate package. Also wondering if there is a way to distribute model layers on two machines.

In case it helps, I wrote a blog post that shows how to run BLOOM (176B largest version) on a desktop computer, even if you don’t have a GPU. In my computer (i5 11gen, 16GB RAM, 1TB SSD Samsung 980 pro), the generation takes 3 minutes per token using only the CPU, which is a little slow but manageable. See the blog post link below.

1 Like

That’s really nice accelerate is enabling that!
But I wonder how compatible it is with slurm ?
I tried to load bloom on a single node with 1 gpu in the Jean Zay cluster, but model loading was killed because of out-of-memory:
“slurmstepd: error: Detected 1 oom-kill event(s) in StepId=992739.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.”

I was expecting accelerate to offload the model weights on disk when there’s not enough RAM, but slurm’s OOM detector killed it before!

Any idea about out to make accelerate work with slurm?
Thank you!

Sorry, my bad: that was because in this shared cluster, accelerate is looking at the full RAM available but Jean Zay environment is killing when one process on a shared node reaches some limit (I dont know exactly how much). So I had to set a limit to accelerate with:

max_memory_mapping = {"cpu": "24GB", 0:"14GB"}
mod = AutoModelForCausalLM.from_pretrained("/pathto/model", low_cpu_mem_usage=True, device_map="auto", offload_folder=offload_dir, max_memory=max_memory_mapping)

Then it works (although very slowly obviously…)

If I add enough ram to my system to load the whole 330 gig on ram how fast would a token take to generate do you think? A couple seconds, or will I just be completely cpu bound at that point? I’d be building a system from scratch primary for this purpose. But more then one higher end rtx card is out of my budget

Having enough RAM to hold the entire model would reduce the execution time; however, you would still be cpu bounded. I did a quick test and, once a BLOOM block is in RAM, my CPU (i5 11gen) takes in average 0.45 sec to run a forward pass on a single bloom block. Therefore, assuming the 70 blocks are already in RAM, you could expect around 70*0.45 sec = 31.5 sec per token.

1 Like