Archive · Oleg Dulin

Building a Supercomputer in AWS: Is it even worth it ?

April 13, 2015

[caption id="attachment_245" align="aligncenter" width="300"]

Columbia Supercomputer
Photo credit Scott Beale[/caption]

The fact that Cray is still around is mind boggling. You'd think that commodity hardware and network technologies have long made supercomputing affordable for anyone interested. And yet, Cray Sells One of the World's Fastest Systems:

“This, to IDC’s knowledge, is the largest supercomputer sold into the O&G sector and will be one of the biggest in any commercial market,” the report stated. “The system would have ranked in the top dozen on the November 2014 list of the world’s Top500 supercomputers.”

Building one of the dozen fastest supercomputers isn’t new for Cray – they’ve got three in the current top 12 now. But what is unique is that most of those 12 belong to government research labs or universities, not private companies. This may be starting to change, however. For example, IDC notes that overall supercomputing spending in the oil and gas sector alone is expected to reach $2 billion in the period from 2013-2018.

Supercomputers come with astronomical costs:

So, you’re in the market for a top-of-the-line supercomputer. Aside from the $6 to $7 million in annual energy costs, you can expect to pay anywhere from $100 million to $250 million for design and assembly, not to mention the maintenance costs

In the 1990s I was involved in a student project to build a Linux Beowulf cluster out of commodity components. It involved a half a dozen quad-core servers, with something like a gigabyte of RAM each. It cost a fortune, and it required us to obtain NSF funding for the project. I don't recall the exact details.

I know, that a similarly configured modern cluster in AWS would cost a few hundred bucks a month if it was used continuously. But even the cluster we built at Clarkson was not used 24/7, and so if done right the same cluster would have cost a fraction of that in AWS.

Turns out I am not the only one who had an idea to build a Beowulf cluster in AWS:

After running through Amazon’s EC2 Getting Started Guide, and Peter’s posts I was up and running with a new beowulf cluster in well under an hour. I pushed up and distributed some tests and it seems to work. Now, it’s not fast compared to even a low-end contemporary HPC, but it is cheap and able to scale up to 20 nodes with only a few simple calls. That’s nothing to sneeze at and I don’t have to convince the wife or the office to allocate more space to house 20 nodes.

That last statement is important. Setting aside the costs, imagine the red tape required to put something like that together with the help of your on-premise IT department ?

At an AWS Summit a couple of years ago Bristol-Myers Squibb gave a talk on running drug trial simulations in AWS:

Bristol-Myers Squibb (BMS) is a global biopharmaceutical company committed to discovering, developing and delivering innovative medicines that help patients prevail over serious diseases. BMS used AWS to build a secure, self-provisioning portal for hosting research so scientists can run clinical trial simulations on-demand while BMS is able to establish rules that keep compute costs low. Compute-intensive clinical trial simulations that previously took 60 hours are finished in only 1.2 hours on the AWS Cloud. Running simulations 98% faster has led to more efficient and less costly clinical trials and better conditions for patients.

If I interpret that case study correctly BMS didn't even bother with an on-premise supercomputer for this.

AWS of course is happy to oblige:

AWS allows you to increase the speed of research by running high performance computing in the cloud and to reduce costs by providing Cluster Compute or Cluster GPU servers on-demand without large capital investments. You have access to a full-bisection, high bandwidth network for tightly-coupled, IO-intensive workloads, which enables you to scale out across thousands of cores for throughput-oriented applications.

So, what would it cost to setup one of the worlds most powerful supercomputers in AWS and run it for one month ? I fully realize that this may not be a very accurate discussion, but let's humor ourselves and try to imagine the biggest of the Top 500 Supercomputers built in AWS.

As of June 2013, the biggest super computer was at National University of Defense Technology in China, and it had 3,120,000 CPU cores. Let's eyeball this in AWS using Amazon's cost calculator. I put together a coupe of different HPC configurations.

Amazon's g2.2xlarge instances have 8 cores and 15 gigabytes of RAM each. To get to the 3,120,000 cores one would need 390000 instances, which would cost $185,562,000.00 for a month, not including business support.

If you use No-Upfront Reserved for 1 year, the cost becomes $134,947,800.00 per month for a year. Three Year All-Upfront Reserved costs $2,889,900,000.00 up front and $100 a month.

Now, here is an important factor. On premises you have to build out the maximum capacity you will ever use, but in the cloud you can dynamically scale up and down as required by your workload. Whereas supercomputing was the domain of governments and wealthy corporations, it is now within reach of anyone building out in AWS.

Let's try this with c4.8xlarge. On-demand this costs $119,901,600.00 a month. Three Year All-Upfront is $1,609,335,000.00 .

Of course, we don't even know if such a thing is even possible on AWS -- to quickly spin up a few hundred thousand servers. How long would that take ? This would probably require a conversation with AWS sales, and probably a volume discount. But either way, something tells me that for such large specialized computational workloads it would be naive to assume that building a supercomputer in the cloud would be cheaper.

This is why renting supercomputing time is still more efficient than both owning one or trying to spin one up in the cloud.