New computing cluster coming to Livermore

Nov. 8, 2018
Article Image

The Corona high-performance computing cluster, built by Penguin Computing, will be delivered to LLNL in late November and is expected to be available for limited use by December. (Download Image)

New computing cluster coming to Livermore

Jeremy Thomas, thomas244@llnl.gov, 925-422-5539

Lawrence Livermore National Laboratory, in partnership with Penguin Computing, AMD and Mellanox Technologies, will accept delivery of Corona, a new unclassified high-performance computing (HPC) cluster that will provide unique capabilities for Lab researchers and industry partners to explore data science, machine learning and big data analytics.

The system will be provided by Penguin Computing and will be comprised of AMD EPYC™ processors and AMD Radeon™ Instinct™ GPU (graphics processing unit) accelerators connected via a Mellanox HDR 200 Gigabit InfiniBand network. The system lends itself to applying machine learning and data analysis techniques to challenging problems in HPC and big data and will be used to support the National Nuclear Security Administration’s (NNSA) Advanced Simulation and Computing (ASC) program. The system will be housed by Livermore Computing (LC) in an unclassified site adjacent to the High Performance Computing Innovation Center (HPCIC), dedicated to partnerships with American industry.

Procured through the Commodity Technology Systems (CTS-1) contract, Corona will help NNSA assess future architectures, fill institutional and ASC needs to develop leadership in data science and machine learning capabilities at scale, provide access to HPCIC partners and extend a continuous collaboration vehicle for AMD, Penguin, Mellanox and LLNL.

“Corona will provide an excellent platform for our research into cognitive computing algorithms and developing predictive simulations for both inertial confinement fusion applications as well as molecular dynamics simulations targeting precision medicine for oncology,” said Brian Van Essen, LLNL Informatics group leader and computer scientist. “The unique computational resources and interconnect will allow us to continue to develop leading edge algorithms for scalable distributed deep learning. As deep learning becomes an integral part of many applications at the Laboratory, computational resources like Corona are vital to our ability to develop the next generation of scientific applications.”

Funded by the LLNL Multi-Programmatic and Institutional Computing (M&IC) program and the NNSA’s ASC program, the 383 teraFLOPS (floating point operations per second) Corona cluster will be delivered in late November and is expected to be available for limited use by December. The cluster consists of 170 two-socket nodes incorporating 24-core AMD EPYC™ 7401 processors and a PCIe 1.6 Terabyte (TB) nonvolatile (solid-state) memory device. Each Corona compute node is GPU-ready with half of those nodes utilizing four AMD Radeon Instinct™ MI25 GPUs per node, delivering 4.2 petaFLOPS of FP32 peak performance. The remaining compute nodes may be upgraded with future GPUs.

Corona is likely to supplant the LLNL Catalyst cluster, a 150-teraFLOPS unclassified HPC cluster. It will run the NNSA-funded Tri-lab Open Source Software (TOSS) that provides a common user environment for Los Alamos, Sandia and Lawrence Livermore national labs. 

“We’re in a unique position working with this heterogenous architecture,” said Matt Leininger, deputy of Advanced Technology Projects for LLNL. “Corona is the next logical step in applying leading-edge technologies to the scientific discovery mission of the Laboratory. This system will be capable of generating big data from HPC simulations, while also being capable of translating that data into knowledge through the use of machine learning and data analysis.”

The HPC Innovation Center at LLNL will offer access to Corona and the expected machine learning innovations it enables as a new option for its ongoing collaboration with American companies and research institutions. 

“Penguin Computing has been working with America’s national energy and defense labs on projects focused on open systems for almost 20 years,” said Sid Mair, senior vice president, federal systems at Penguin Computing. “During this long collaboration, we’ve been able to help them take advantage of the value, both in terms of return on investment and flexibility, that open systems provide compared to proprietary systems. Helping them deploy AI using open systems in the Corona system is an exciting new chapter in this relationship that we hope will help them execute their mission even more effectively.”

“AMD welcomes the delivery of the Corona system to the HPCIC and the selection of high-performance AMD EPYC processors and AMD Radeon Instinct accelerators for the cluster,” said Mark Papermaster, AMD’s senior vice president and chief technology officer. “The collaboration between AMD, Penguin, Mellanox and Lawrence Livermore National Lab has built a world-class HPC system that will enable researchers to push the boundaries of science and innovation.”

The system is interconnected via the new-generation high-performance Mellanox HDR 200G InfiniBand network, enabling the Lab to accelerate applications and increase scaling and efficiencies. The diverse mixture of computing technologies will allow LLNL and Corona partners to explore new approaches to cognitive simulation – blending machine learning and HPC – and intelligence-based data analytics. 

“HDR 200G InfiniBand brings a new level of performance and scalability needed to build the next generation of high-performance computing and artificial intelligence system,” said Gilad Shainer, vice president of marketing at Mellanox Technologies. “The collaboration between Penguin, AMD and LLNL results in a technology-leading platform that will progress science and discovery at the Lab.”