National labs, industry partners prepare for new era of computing through Centers of Excellence

(Download Image) IBM employees and Lab code and application developers held a “Hackathon” event in June to work on coding challenges for a predecessor system to the Sierra supercomputer. Through the ongoing Centers of Excellence (CoE) program, employees from IBM and NVIDIA have been on-site to help LLNL developers transition applications to the Sierra system, which will have a completely different architecture than the Lab has had before. Photo by Jeremy Thomas/LLNL

The Department of Energy’s drive toward the next generation of supercomputers, "exascale" machines capable of more than a quintillion (1018) calculations per second, isn’t simply to boast about having the fastest processing machines on the planet. At Lawrence Livermore National Laboratory (LLNL) and other DOE national labs, these systems will play a vital role in the National Nuclear Security Administration’s (NNSA) core mission of ensuring the nation’s nuclear stockpile in the absence of underground testing.  

The driving force behind faster, more robust computing power is the need for simulation and codes that are higher resolution, increasingly predictive and incorporate more complex physics. It’s an evolution that is changing the way the national labs’ application and code developers are approaching design. To aid in the transition and prepare researchers for pre-exascale and exascale systems, LLNL has brought experts from IBM and NVIDIA together with Lab computer scientists in a Center of Excellence (CoE), a co-design strategy born out of the need for vendors and government to work together to optimize emerging supercomputing systems.

"There are disruptive machines coming down the pike that are changing things out from under us," said Rob Neely, an LLNL computer scientist and Weapon Simulation & Computing Program coordinator for Computing Environments. "We need a lot of time to prepare; these applications need insight, and who better to help us with that than the companies who will build the machines? The idea is that when a machine gets here, we’re not caught flat-footed. We want to hit the ground running right away."

While LLNL’s exascale system isn’t scheduled for delivery until 2023, Sierra, the Laboratory’s pre-exascale system, is on track to begin installation this fall and will begin running science applications at full machine scale by early next spring. Built by IBM and NVIDIA, Sierra will have about six times more computing power than LLNL’s current behemoth, Sequoia. The Sierra system is unique to the Lab in that it’s made up of two kinds of hardware -- IBM CPUs and NVIDIA GPUs -- that have different memory locations associated with each type of computing device and a programming model more complex than LLNL scientists have programmed to in the past. In the meantime, Lab scientists are receiving guidance from experts from the two companies, utilizing a small predecessor system that is already running some components and has some of the technological features that Sierra will have.

LLNL’s Center of Excellence, which began in 2014, involves about a half dozen IBM and NVIDIA personnel on-site, and a number of remote collaborators who work with Lab developers. The team is on hand to answer any questions Lab computer scientists have, educate LLNL personnel to use best practices in coding hybrid systems, develop strategies for optimizations, debug and advise on global code restructuring that often is needed to obtain performance. The CoE is a symbiotic relationship -- LLNL scientists get a feel for how Sierra will operate, and IBM and NVIDIA gain better insight into what the Lab’s needs are and what the machines they build are capable of.

"We see how the systems we design and develop are being used and how effective they can be," said IBM research staff member Leopold Grinberg, who works on the LLNL site. "You really need to get into the mind of the developer to understand how they use the tools. To sit next to the developers’ seats and let them drive, to observe them, gives us a good idea of what we are doing right and what needs to be improved. Our experts have an intimate knowledge of how the system works, and having them side-by-side with Lab employees is very useful."

Sierra, Grinberg explained, will use a completely different system architecture than what has been used before at LLNL. It’s not only faster than any machine the Lab has had, it also has different tools built into the compilers and programming models. In some cases, the changes developers need to make are substantial, requiring restructuring hundreds or thousands of lines of code. Through the CoE, Grinberg said he’s learning more about how the system will be used for production scientific applications.

"It’s a constant process of learning for everybody," Grinberg said. "It’s fun, it’s challenging. We gather the knowledge and it’s also our job to distribute it. There’s always some knowledge to be shared. We need to bring the experience we have with heterogeneous systems and emerging programming models to the lab, and help people generate updated codes or find out what can be kept as is to optimize the system we build. It’s been very fruitful for both parties."

The CoE strategy is additionally being implemented at Oak Ridge National Laboratory, which is bringing in a heterogeneous system of its own called Summit. Other CoE programs are in place at Los Alamos and Lawrence Berkeley national laboratories. Each CoE has a similar goal of preparing computational scientists with the tools they will need to utilize pre-exascale and exascale systems. Since Livermore is new to using GPUs for the bulk of computing power, the Sierra architecture places a heavy emphasis on figuring out which sections of a multi-physics application are the most performance-critical, and the code restructuring that must take place to most effectively use the system.

"Livermore and Oak Ridge scientists are really pushing the boundaries of the scale of these GPU-based systems," said Max Katz, a solutions architect at NVIDIA who spends four days a week at LLNL as a technical adviser. "Part of our motivation is to understand machine learning and how to make it possible to merge high-performance computing with the applications demanded by industry. The CoE is essential because it’s difficult for any one party to predict how these CPU/GPU systems will behave together. Each one of us brings in expertise and by sharing information, it makes us all more well-rounded. It’s a great opportunity."

In fact, the opportunity was so compelling that in 2016 the CoE was augmented with a three-year institutional component (dubbed the Institutional Center of Excellence, or iCE) to ensure that other mission critical efforts at the Laboratory also could participate. This has added nine applications development efforts, including one in data science, and expanded the number of IBM and NVIDIA personnel. By working together cooperatively, many more types of applications can be explored, performance solutions developed and shared among all the greater CoE code teams. 

"At the end of the iCOE project, the real value will be not only that some important institutional applications run well, but that every directorate at LLNL will have trained staff with expertise in using Sierra, and we’ll have documented lessons learned to help train others," said Bert Still, leader for Application Strategy (Livermore Computing).

Steve Rennich, a senior HPC developer-technology engineer with NVIDIA, visits the Lab once a week to help LLNL scientists port mission-critical applications optimized for CPUs over to NVIDIA GPUs, which have an order of magnitude greater computing power than CPUs. Besides writing bug-free code, Rennich said, the goal is to improve performance enough to meet the Lab’s considerable computing requirements.

"The challenge is they’re fairly complex codes so to do it correctly takes a fair amount of attention to detail," Rennich said. "It’s about making sure the new system can handle as large a model as the Lab needs. These are colossal machines, so when you create applications at this scale, it’s like building a race car. To take advantage of this increase in performance, you need all the pieces to fit and work together."

Current plans are to continue the existing Center of Excellence at LLNL at least into 2019, when Sierra is fully operational. Until then, having experts working shoulder-to-shoulder with Lab developers to write code will be a huge benefit to all parties, said LLNL’s Neely, who wants the collaboration to publish their discoveries to share it with the broader computing community.

"We’re focused on the issue at hand, and moving things toward getting ready for these machines is hugely beneficial," Neely said. "These are very large applications developed over decades, so ultimately it’s the code teams that need to be ready to take this over. We’ve got to make this work because we need to ensure the safety and performance of the U.S. stockpile in the absence of nuclear testing. We’ve got the right teams and people to pull this off."

For more information, visit the web.