Lawrence Livermore National Laboratory (LLNL) is collaborating with Penguin Computing Inc. and graphics card manufacturer AMD to upgrade its unclassified computing cluster Corona to roughly double the amount of graphics processors (GPUs) the system previously had. The upgrade will provide significantly greater performance and bring additional capabilities in artificial intelligence and machine learning to the LLNL user community, according to Lab researchers.
Penguin Computing recently announced that Corona, a high performance computing cluster delivered to LLNL in 2018, has been upgraded with the newest AMD Radeon Instinct MI60 accelerators, based on “Vega,” the world’s first 7-nanometer GPU architecture. The upgrade is funded through the Commodity Technology Systems (CTS-1) contract with the National Nuclear Security Administration (NNSA).
Corona is being made available to industry through LLNL’s High Performance Computing Innovation Center (HPCIC). The upgrade will help LLNL researchers and their industry partners improve capabilities in scalable deep learning, big data analytics and data science, while enhancing NNSA’s ability to assess future architectures and meet the needs of NNSA’s Advanced Simulation & Computing program. It also will provide a higher level of performance for researching cognitive computing and developing predictive simulations for applications such as inertial confinement fusion and molecular dynamics simulations for precision medicine.
“This upgrade significantly increases the capability available on Corona,” said Bronis R. de Supinski, chief technical officer for Livermore Computing. “The new Vega GPUs offer substantial double-precision performance, in addition to much more single-precision performance. LLNL scientists will use the combination to understand the potential of mixed-precision algorithms for a variety of domains.”
The Corona cluster consists of 170 two-socket nodes with 24-core AMD EPYC 7401 processors and a PCIe 1.6 terabyte solid-state memory device. Each Corona compute node is GPU-ready with half of the nodes utilizing four AMD Radeon Instinct MI25 accelerators per node, delivering 4.2 petaflops of FP32 peak performance. With the MI60 upgrade, the cluster increases its potential peak performance to 9.45 petaflops of FP32 peak performance. The accelerators are connected via a Mellanox HDR 200 Gigabit InfiniBand network.
“The Penguin Computing Department of Energy team continues our collaborative venture with our vendor partners AMD and Mellanox to ensure the Livermore Corona GPU enhancements expand the capabilities to continue their mission outreach within various machine learning communities,” said Ken Gudenrath, director of Federal Systems at Penguin Computing.
AMD’s Radeon Instinct MI60 accelerators utilize the company’s Infinity Fabric Link technology, a peer-to-peer GPU communications technology that delivers up to 184 gigabytes per second transfer speeds between GPUs. The new accelerators also utilize the latest ROCm open-source software stack, which is integrated into frameworks like TensorFlow and PyTorch and maps workloads to the heterogeneous compute resources of the underlying hardware.
“AMD is pleased to continue collaboration with LLNL and the NNSA in advancing open accelerator solutions. Access to systems like Corona enable next-generation scientific discovery as we move to the exascale era,” said Ogi Brkic, corporate vice president and general manager of the Data Center GPU Business Unit at AMD.
thomas244 [at] llnl.gov