Jump to content

MIT professor pushes computing limits with the largest cluster ever built in the public cloud

Using Google Compute Engine to manage the L-Functions and Modular Forms Database (LMFDB), computational number theorist and Principal Research Scientist, Andrew V. Sutherland of MIT, breaks his own high performance computing record, reaching 580,000 cores.

Walk into a math philosophy class and you’ll likely hear talk about “objects.” Objects are essentially anything on which you can perform mathematics, such as numbers or functions, or the results of mathematical calculations, such as curves. The L-Functions and Modular Forms Database (LMFDB) is a detailed atlas of objects and the connections among them. LMFDB is a collaboration between international researchers and is guided by an international team based at universities in Europe and North America, including MIT.

Your whole outlook on research changes when you can ask a question and get an answer in hours rather than months.

Andrew V. Sutherland, Computational number theorist and Principal Research Scientist, MIT

Sharing data across researchers

LMFDB advances science by making it far easier for researchers to share data about objects with the physics, computer science and mathematics communities around the world. Some of the calculations to create the objects are so complex that only a few people on Earth know how to perform them. Other calculations are so big it’s best to run them only once because of how time-consuming and expensive they are to perform.

The team running LMFDB needed a cloud service that could handle their growing storage requirements. To put things into perspective, it has taken nearly 1,000 years of compute time to create the objects within LMFDB. Beyond the massive problem of storage was the issue of scale. LMFDB is available to anyone at lmfdb.org, meaning the project needed to scale to support the countless searches performed daily. Finally, because LMFDB is a collaborative project, the team needed a system that people in different countries could easily administer.

Focusing on research not infrastructure

The LMFDB team looked at several cloud solutions and chose Google Cloud because of its high performance, ability to scale automatically, ease of use and reliability.

One of the primary researchers involved in LMFDB and the decision-making process was Andrew V. Sutherland, a math professor, computational number theorist and principal research scientist at MIT.

“We are mathematicians who want to focus on our research, and not have to worry about hardware failures or scaling issues with the website,” says Sutherland.

Sutherland and the rest of the LMFDB team opted to use Google Compute Engine (GCE) and Google Persistent Disk to host the web servers, and mirrored MongoDB databases to store a half-terabyte of online data and three terabytes of less frequently accessed data. This set-up allows LMFDB to scale as needed, and deliver computational results and mathematical objects quickly when researchers need them. LMFDB also uses a variety of Google Cloud tools that allow researchers in different parts of the world to more easily and collaboratively manage the database. These tools include Google Stackdriver, Google Cloud Console and Google Cloud Load Balancing.

Sutherland had a particularly complex tabulation he needed to perform and store in LMFDB, one so massive it would require computing power beyond the limits of what had previously been done in the public cloud. To do it, he chose GCE and ran 580,000 cores with preemptible VMs — the largest known high-performance computer cluster ever run in the public cloud.

The calculation resulted in 70,000 different curves, each with its own LMFDB entry. Finding just one of those curves is an exceedingly complex task requiring a high number of computing cycles. “It’s like searching for a needle in a fifteen-dimensional haystack,” Sutherland says.

Before turning to GCE to perform the calculation, Sutherland had run jobs on his own 64-core computer, which took far too long. His only alternative was to obtain compute time on MIT’s clusters, which could be difficult to get and limited the software configurations he could use. With GCE, he can use as many cores as he requires, install the precise operating system, libraries and applications he needs and update the environment whenever he wants.

Thanks the scalability Google Cloud gives to LMFDB, everyone from students to experienced researchers can easily search and navigate its contents via a web interface. For instance, Sutherland teaches a classes on elliptic curves, and students use LMFDB for their homework.

Saving money when performing massive calculations

Given budget constraints of many researchers and educational institutions, Google Cloud lets them perform massive calculations at a reasonable cost. The GCE preemptible VMs Sutherland uses allow him to dramatically reduce costs while performing extremely complex calculations. These fully featured instances cost up to 80 percent less than their regular equivalents because they can be interrupted by GCE. Interrupting computations doesn’t cause a big performance hit — on average, only two to three percent of his instances are interrupted in each hour of computations, and a script automatically restarts them until his entire job is done, so little time is lost. By allowing these minor interruptions, he can run giant calculations at low cost and with practically no delay.

We are mapping the mathematics of the 21st century

Andrew V. Sutherland, Computational number theorist and Principal Research Scientist, MIT

Sign up here for updates, insights, resources, and more.