Back

LLNL, IBM and Red Hat joining forces to explore standardized HPC resource management interface

mou (Download Image)

Under a new memorandum of understanding, researchers at Lawrence Livermore National Laboratory (LLNL), IBM and Red Hat will aim to enable next-generation workloads by integrating LLNL’s Flux scheduling framework with Red Hat OpenShift to allow more traditional high performance computing jobs to take advantage of cloud and container technologies. Pictured, from left, are LLNL postdoctoral researcher Dan Milroy and computer scientists Stephen Herbein and Dong H. Ahn.

Lawrence Livermore National Laboratory (LLNL), IBM and Red Hat are combining forces to develop best practices for interfacing high performance computing (HPC) schedulers and cloud orchestrators, an effort designed to prepare for emerging supercomputers that take advantage of cloud technologies.

Under a recently signed memorandum of understanding (MOU), researchers aim to enable next-generation workloads by integrating LLNL’s Flux scheduling framework with Red Hat OpenShift — a leading enterprise Kubernetes platform — to allow more traditional HPC jobs to utilize cloud and container technologies. A new standardized interface would help satisfy an increasing demand for compute-intensive jobs that combine HPC with cloud computing across a wide range of industry sectors, researchers said.

“Cloud systems are increasingly setting the directions of the broader computing ecosystem, and economics are a primary driver,” said Bronis de Supinski, chief technology officer of Livermore Computing at LLNL. “With the growing prevalence of cloud-based systems, we must align our HPC strategy with cloud technologies, particularly in terms of their software environments, to ensure the long-term sustainability and affordability of our mission-critical HPC systems.”

LLNL’s open source Flux scheduling framework builds upon the Lab’s extensive experience in HPC and allows new resource types, schedulers and services to be deployed as data centers continue to evolve, including the emergence of exascale computing. Its ability to make smart placement decisions and rich resource expression make it well-suited to facilitate orchestration using tools like Red Hat OpenShift on large-scale HPC clusters, which LLNL researchers anticipate becoming more commonplace in the years to come.

“One of the trends we’ve been seeing at Livermore is the loose coupling of HPC applications and applications like machine learning and data analytics on the orchestrated side, but in the near future we expect to see a closer meshing of those two technologies,” said LLNL postdoctoral researcher Dan Milroy. “We think that unifying Flux with cloud orchestration frameworks like Red Hat OpenShift and Kubernetes is going to allow both HPC and cloud technologies to come together in the future, helping to scale workflows everywhere. I believe co-developing Flux with OpenShift is going to be really advantageous.”

Red Hat OpenShift is an open source container platform based on the Kubernetes container orchestrator for enterprise app development and deployment. Kubernetes is an open-source system for automating deployment, scaling and management of containerized applications.

Researchers want to further enhance Red Hat OpenShift and make it a common platform for a wide range of computing infrastructures, including large-scale HPC systems, enterprise systems and public cloud offerings, starting with commercial HPC workloads.

“We would love to see a platform like Red Hat OpenShift be able to run a wide range of workloads on a wide range of platforms, from supercomputers to clusters,” said IBM Research Staff Member Claudia Misale. “We see difficulties in the HPC world from having many different types of HPC software stacks, and container platforms like OpenShift can address these difficulties. We believe OpenShift can be a common denominator, like Red Hat Enterprise Linux has been a common denominator on HPC systems.”

The impetus for enabling Flux as a Kubernetes scheduler plug-in began with a successful prototype that came from a collaboration of Oak Ridge, Argonne and Livermore (CORAL) and Centers of Excellence project between LLNL and IBM to understand the formation of cancer. The plug-in enabled more sophisticated scheduling of Kubernetes workflows, which convinced researchers they could integrate Flux with Red Hat OpenShift, researchers said.

Because many HPC centers use their own schedulers, a primary goal is to “democratize” the Kubernetes interface for HPC users, pursuing an open interface that any HPC site or center could utilize and incorporate their existing schedulers.

“We’ve been seeing a steady trend toward data-centric computing, which includes the convergence of artificial intelligence/machine learning and HPC workloads,” said Chris Wright, senior vice president and chief technology officer at Red Hat. “The HPC community has long been on the leading edge of data analysis. Bringing their expertise in complex large-scale scheduling to a common cloud-native platform is a perfect expression of the power of open source collaboration. This brings new scheduling capabilities to Red Hat OpenShift and Kubernetes and brings modern cloud-native AI/ML applications to the large labs.”

The researchers plan to initially integrate Flux to run within the Red Hat OpenShift environment, using Flux as a driver for other commonly used schedulers to interface with OpenShift and Kubernetes, eventually facilitating the platform for use with any HPC workload and on any HPC machine.

“This effort will make it easy for HPC workflows to leverage leading HPC schedulers like Flux to realize the full potential of emerging HPC and cloud environments,” said Dong Ahn, lead for LLNL’s Advanced Technology Development and Mitigation Next Generation Computing Enablement project.

The team has begun working on scheduling topology and anticipates defining an interface within the next six months. Future goals include exploring different integration models such as co-location, extending advanced management and configuration beyond the node.

For more information, visit the web.