Revolutionary tool speeds up GPU programs for scientific applications

Revolutionary tool speeds up GPU programs for scientific applications (Download Image)

Lawrence Livermore National Laboratory computer scientist Konstantinos Parasyris presents his team’s paper on the Record-Replay technique for speeding up applications on GPU-based systems at the 2023 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC23). Photo courtesy of SC Photography.

A Lawrence Livermore National Laboratory (LLNL)-led team has developed a method for optimizing application performance on largescale graphics processing unit (GPU) systems, providing a useful tool for developers running on GPU-based massively parallel and distributed machines.

A recent paper, which features four LLNL co-authors, describes a mechanism called Record-Replay (RR), which speeds up applications on GPUs by recording how a program runs on a GPU, and then replaying that recording to test different settings and finding the fastest way to run the program. The paper was a finalist for the Best Paper award at the 2023 International Conference for High Performance Computing, Networking, Storage and Analysis (SC23). 

“We developed a tool that automatically picks up part of the application, moves it as an independent piece so you can start independently and then optimize it,” said lead author and LLNL computer scientist Konstantinos Parasyris. “Once it is optimized, you can plug it into the original application. By doing so you can reduce the execution time of the entire application and do science faster.”

In the paper, the authors describe how RR works and how it can be used to improve the performance of OpenMP GPU applications. Parasyris said the mechanism helps “autotune” large offload applications, thus overcoming a major bottleneck for speed in scientific applications. 

As a case study, the authors demonstrated how RR was used to optimize performance of LULESH, a shockwave hydrodynamics code that simulates the behavior of materials under stress. By using the Record-Replay mechanism, researchers were able to optimize LULESH to run up to 50% faster on GPUs, an improvement that could make it possible to simulate much larger and more complex materials, key for many scientific and engineering applications. 

“You can directly translate that as a 50% speed-up on an application,” Parasyris said. “Science is driven by how fast you can do observations, and in that case, we'll be able to increase the number of observations within a day; so, the [calculation] that previously took a day to do, would only take two-thirds of the day.”  

By using the RR mechanism, researchers said they can test many different settings quickly and efficiently, making it possible to use Bayesian optimization — a method for finding the optimal settings for a program by testing different options and using statistics to determine which ones work best — on very large programs that would otherwise be too time-consuming to optimize.

“Every program has many parameters that you can use to optimize it, and those parameters can be combined if it's out there,” said LLNL scientist and co-author Giorgis Georgakoudis said. “If you find the best way to combine them, you can significantly reduce the execution time of the application, so that's our goal.” 

Researchers said they plan to continue the work by investigating more use-cases facilitated by RR, such as tuning the compiler optimization pipeline, automatic benchmark generation and automated testing and debugging.

Other LLNL co-authors include Ignacio Laguna and Johannes Doerfert of LLNL, and Esteban Rangel of Argonne National Laboratory. The work was supported by the Laboratory Directed Research and Development Program. 

In addition to the Best Paper finalist, Parasyris and Georgakoudis also co-authored a paper presented at SC23 by LLNL intern Zane Fink on HPAC-Offload, a programming model allowing portable approximate computing (AC) to accelerate HPC applications on GPUs. The technique involves identifying and selectively approximating parts of the application that have low significance, resulting in significant performance improvements while minimizing quality loss.

The authors demonstrated the effectiveness of HPAC-Offload on several HPC benchmarks, conducting a comprehensive performance analysis of the tool across GPU-accelerated HPC applications. They found that AC techniques can significantly accelerate HPC applications (1.64x LULESH on AMD, 1.57x on NVIDIA GPUs) with minimal quality loss (0.1%).  The team also provided insights into the interplay between approximate computing and GPU-based parallelism, which can guide the future development of AC algorithms and systems for these architectures. That paper also includes LLNL computer scientist Harshitha Menon as a co-author.