This is a WIKI post - so if you feel you can contribute please answer a few questions, improve upon existing answers or add an alternative answer or add new questions:
This thread is to discuss Multi-GPU machine setup for ML.
Basic Recommendations
Q. What are basic recommendations on how to design a multi-GPU machine?
Would be great to factor in price vs performance (so we can know how much we save vs pre-built)?
A. See the links to the guides in the Resources sections below.
Critical decisions to make
Q. What are the smartest decisions to make it future proof (mine is already obsolete)?
A. Computers are black holes that suck everything in and give little out (other than some RGB colors). There is no such thing as future proofing in modern computers, other than mechanical parts like your PC tower.
Q. Can we do it at all or is it necessary to redesign it every 1-2 years?
- Ideally you just upgrade parts as they need upgrading, rather than replacing the whole PC. I use a 10-year old tower still.
In-house vs. cloud
Q. Is it worth building a good local machine or should you just learn how to leverage the cloud?
A. Typically, for small set ups - up to several consumer GPUs, it’s almost always worth to have a local setup than cloud, unless you find some upstart cloud provider that for a while underprices their cost-per-hour.
Pros:
- Of course, it depends on your usage patterns. If you are going to use it once in a blue moon, cloud it is. If you use it a lot then local will be cheaper. You can calculate your costs to purchase the machine vs. renting it.
- Not needing to worry about forgetting to turn the instance off and having the $$ counter running might be another plus.
- Heat is good. Heat is bad. In cold countries a home-based ML server is a great adjunct to keeping your working space warm. Not so much if you live in tropics.
Cons:
- If you want a lot of large GPUs you might not be able to build it on consumer-level hardware, or the cost might be prohibitively expensive.
- Electricity cost is another factor. Some cities have very expensive electricity. Especially if you go over the “normal” usage quota that some electric companies have.
- Hardware gets outdated fast, so your needs may quickly become larger than what you have. You may or may not be able to recover some of the investment when trying to sell your old hardware.
Key components
Q .What are the main components to look for?
Q. Sample setups would be great too (and why they are great).
A.
- Make sure your CPU has enough PCIe lanes to support all the cards you plan to use
- Make sure your MB has enough PCIe slots and they are at the right distance to support modern GPUs that take up 2 slots.
- Research your PSU - so that it has enough extra power to handle those power-hungry GPUs
- Plan to have a lot of RAM, so ideally buy as large of a single RAM stick as possible. i.e. try not to fill out all RAM slots from the get going unless you buy some 256GB from the get going.
- NVMe slot or a few are going to be super-important. Try to have your OS on a different drive (e.g. SSD) - you don’t want to share your data NVMe with your OS operations.
- Does the box have enough space for cooling? Be it water cooling or lots of fans.
- Definitely don’t buy those pre-packaged PCs by large retailers, you can’t mod those. Buy your own components and plan for expansion.
Puchase Timing
Q. Is it a good time to buy GPU or when to know when there are good deals (seem a bit high right now)?
A. Black Friday in North America gives you by far the best deals. But don’t just buy because it’s BF, do your research, since some companies raise their prices, instead of lowering those.
Resources
Blogs focusing on ML Hardware:
- The Best 4-GPU Deep Learning Rig only costs $7000 not $11,000
- Tim Dettmers’ great posts about choosing GPUs for deep learning and Hardware Guide to Deep Learning. The guides do not focus on distributed setup, but there are suggestions on multi-GPU machines and how to select a GPU for your task and budget.