Multi-GPU Machine Setup Guide and QnA

boris · April 30, 2021, 7:51pm

This is a WIKI post - so if you feel you can contribute please answer a few questions, improve upon existing answers or add an alternative answer or add new questions:

This thread is to discuss Multi-GPU machine setup for ML.

Basic Recommendations

Q. What are basic recommendations on how to design a multi-GPU machine?
Would be great to factor in price vs performance (so we can know how much we save vs pre-built)?

A. See the links to the guides in the Resources sections below.

Critical decisions to make

Q. What are the smartest decisions to make it future proof (mine is already obsolete)?

A. Computers are black holes that suck everything in and give little out (other than some RGB colors). There is no such thing as future proofing in modern computers, other than mechanical parts like your PC tower.

Q. Can we do it at all or is it necessary to redesign it every 1-2 years?

Ideally you just upgrade parts as they need upgrading, rather than replacing the whole PC. I use a 10-year old tower still.

In-house vs. cloud

Q. Is it worth building a good local machine or should you just learn how to leverage the cloud?

A. Typically, for small set ups - up to several consumer GPUs, it’s almost always worth to have a local setup than cloud, unless you find some upstart cloud provider that for a while underprices their cost-per-hour.

Pros:

Of course, it depends on your usage patterns. If you are going to use it once in a blue moon, cloud it is. If you use it a lot then local will be cheaper. You can calculate your costs to purchase the machine vs. renting it.
Not needing to worry about forgetting to turn the instance off and having the $$ counter running might be another plus.
Heat is good. Heat is bad. In cold countries a home-based ML server is a great adjunct to keeping your working space warm. Not so much if you live in tropics.

Cons:

If you want a lot of large GPUs you might not be able to build it on consumer-level hardware, or the cost might be prohibitively expensive.
Electricity cost is another factor. Some cities have very expensive electricity. Especially if you go over the “normal” usage quota that some electric companies have.
Hardware gets outdated fast, so your needs may quickly become larger than what you have. You may or may not be able to recover some of the investment when trying to sell your old hardware.

Key components

Q .What are the main components to look for?
Q. Sample setups would be great too (and why they are great).

A.

Make sure your CPU has enough PCIe lanes to support all the cards you plan to use
Make sure your MB has enough PCIe slots and they are at the right distance to support modern GPUs that take up 2 slots.
Research your PSU - so that it has enough extra power to handle those power-hungry GPUs
Plan to have a lot of RAM, so ideally buy as large of a single RAM stick as possible. i.e. try not to fill out all RAM slots from the get going unless you buy some 256GB from the get going.
NVMe slot or a few are going to be super-important. Try to have your OS on a different drive (e.g. SSD) - you don’t want to share your data NVMe with your OS operations.
Does the box have enough space for cooling? Be it water cooling or lots of fans.
Definitely don’t buy those pre-packaged PCs by large retailers, you can’t mod those. Buy your own components and plan for expansion.

Puchase Timing

Q. Is it a good time to buy GPU or when to know when there are good deals (seem a bit high right now)?

A. Black Friday in North America gives you by far the best deals. But don’t just buy because it’s BF, do your research, since some companies raise their prices, instead of lowering those.

Resources

Blogs focusing on ML Hardware:

The Best 4-GPU Deep Learning Rig only costs $7000 not $11,000
Tim Dettmers’ great posts about choosing GPUs for deep learning and Hardware Guide to Deep Learning. The guides do not focus on distributed setup, but there are suggestions on multi-GPU machines and how to select a GPU for your task and budget.

dropout05 · April 30, 2021, 8:23pm

I would recommend to check out Tim Dettmers’ great posts about choosing GPUs for deep learning and Hardware Guide to Deep Learning. The guides do not focus on distributed setup, but there are suggestions on multiGPU machines and how to select a GPU for your task and budget.

stas · April 30, 2021, 8:26pm

Thank you! merged it into the OP.

Please feel free to put your notes directly in there and we will progressively massage it into a readable/organized doc.

Sanyam · May 1, 2021, 4:41am

I’ve answered all of these Qs along with some tips on how to best air cool these in my recent video:

lewtun · May 1, 2021, 8:01am

thanks @Sanyam! i’ve added your video to the OP

radek · May 1, 2021, 12:22pm

I really liked this blog post by Emil Wallner, lots of good information there including some good insights on current hw options (will probably change in a couple of months)

Emil makes a very good point why a home rig is the way to go:

The main reason to own hardware is workflow. To not waste time on cloud savings and encourage robust experimentation.

I would also recommend this hardware guide by Tim Dettmers. It is the definitive resource with timeless answers to many questions

Two observations from Tim Dettmers’ guide worth highlighting:

the number of PCI lanes is not as important as it seems
RAM timings are not important

Both of these points above can save you a lot of money.

radek · May 1, 2021, 12:23pm

(had to split the post in two as new users can post max 2 links)

Other than that, the quality of PSUs really differs - it is important what PSU you go for (watts given by the manufacturer is next to meaningless). I did a bit of an investigation on this here.

Topic		Replies	Views
What hardware do you use to train your models? Cloud or local? Intermediate	0	793	October 31, 2022
Buying a ML server in Germany Beginners	0	284	April 7, 2021
BUYING ADVICE for local LLM machine Beginners	9	2707	March 26, 2025
Best multi-GPU setup for finetuning and inference? Intermediate	0	545	July 3, 2024
New: Distributed GPU Platform Research	2	670	November 8, 2023