In any market or industry, quality work at scale requires purpose-built high-performance tools.
In a previous article (The Eight Challenges You'll Face With On-Premise Artificial Intelligence), we promised to cover this topic more in-depth in a separate post. Well, we like to keep our promises, so here we are.
It turns out that 3 of the Challenges we listed (Servers, Networking, and Storage) are all related to the quality of the hardware and how that can significantly diminish or even eliminate the performance gains you hoped to achieve with it.
The biggest appeal to using Consumer-Grade hardware is the price. Almost everything is cheaper: CPU, Memory, Motherboard, GPU, PSU, Storage, etc. And it is cheaper even when the consumer-grade component specs are almost the same as their server-grade counterpart.
The second appeal is that they're generally available. You can walk up to any Computer Hardware store and come out with the parts you need to put together an exemplary configuration capable of some serious computing right out of the box instead of going through the purchase process of Server-grade hardware and wait for its delivery.
Finally, the knowledge base available for consumer-grade hardware is more abundant. If you do a quick search on Google or YouTube for tutorials on how to put together a consumer-grade piece of hardware, the odds are that you’ll find several tutorials and reviews. One of the most common reasons why people are hesitant when ordering server-grade hardware is the fear of making a mistake and ordering something costly that won’t work for that application. Unfortunately, companies selling server-grade hardware online aren’t exactly known for making the process any easier.
Those are all positive points going for consumer-grade solutions, but in our experience, the negative points far outweigh the positives.
Server-grade hardware is designed to run 24x7 in mission-critical operations without hitches. This is not to say that they’re fail-proof, but they are a lot more resistant than their consumer-grade counterparts. Aside from the non-stop operational requirement, these systems are run in a different environment and under a significantly high and sustained load.
Server grade components are installed and operated in tight spaces to increase computational density. In order to avoid overheating, manufactures have to agree on standards that consumer-grade hardware does not need to follow. One good example is that neither GPUs nor CPUs uses a dedicated fan in server environments. Instead, they rely on the system’s fans and a well-designed heatsink that is well-positioned to promote the most heat exchange possible with the air circulating inside the chassis. These standards are designed to reduce turbulent air and the trapping of hot air pockets inside the system. And, by the way, those system fans are very powerful. If you have ever stepped into a server room, you have undoubtedly noticed them!
If you had to run these fans inside your desktop, you’d need baffles to keep your hearing (and sanity) when working next to it for hours every day. Instead of turbines, they rely on the fact that they don’t see too much load on average and don’t need to run as many hours. Desktop computers also enjoy a more generous internal space inside the case that allows for oversized heatsinks and fans to keep the system cool quietly. These desktops don’t have much regard for preventing turbulence or hot air pockets inside the case because, again, there is not enough load to require that kind of optimization.
When you place these components in a high-density environment, they’ll start to fail sooner than expected. And in some cases, they can even pose a fire hazard. Here is a neat [example] to illustrate my point: an EVGA GTX 1080 catching fire during system boot on camera due to poor engineering and lack of standards by EVGA.
Similar to Formula 1, server-grade hardware typically enjoy early access to better and more performant standards and technology. Some of this technology makes it to the consumers, and some don’t. Take Nvidia’s SXM4 card slot. It allows for GPUs to communicate at up to ten times the speed of PCI-e and offers many other optimizations that are not available for consumers and probably will never be.
Flagship gaming GPUs enjoy a generous amount of RAM, but none of them comes close to their current server-grade counterparts. Whereas Nvidia’s GeForce RTX 3090 or Nvidia’s Titan RTX boasts 24Gb, Nvidia’s Tesla A100 has 40Gb and 80Gb versions available, with the highest memory bandwidth on the market today.
If we’re talking about the system’s RAM, there is also an advantage for server-grade hardware. Flagship consumer-grade motherboards today can host up to 256Gb of RAM, while most server-grade motherboards can run with up to 2Tb of RAM in their banks.
Networking is the other field in which servers have access to much faster and lower latency technologies that consumers will likely never need, such as the current baseline speed for any respectable AI Cluster, which is 100Gbps. When operating with the proper hardware, you can even have RDMA at your disposal, which allows your GPU to essentially bypass the system and load its memory with data coming straight from the network, significantly increasing the performance you can take out of your hardware.
All hardware can be sold on eBay, but server-grade hardware enjoys a unique market with several companies ready to bid on your used servers to take them off your hands, refurbish, and resell. When it comes the time to upgrade your infrastructure, this is one more problem you won’t have to deal with.
He ability to provision and control computer remotely is standardized and readily available with server-grade hardware. This means that instead of having to make a trip to the server room when a machine fails, you can monitor and access this machine remotely from wherever you are and solve most issues even if the device is turned off. That is due to almost all server-grade hardware contains a baseboard management controller (BMC) that complies with the IPMI specifications. Virtually no consumer-grade motherboard enjoys this capability, requiring trips to the server room whenever the hardware fails or needs to be monitored.
Here at Amalgam, we’ve learned several lessons with our attempts to use off-the-shelf solutions to do serious business. And after a lot of suffering and compromises, we have achieved a certain level of success with them. In the end, this experience has actually taught us that ultimately the losses caused by the use of consumer-grade hardware in a server environment exceed the savings that can be gained from the price delta between the two categories of equipment. Yes, you’ll incur losses from hardware failure, but your most significant losses are going to be in maintenance and productivity.