Dr. Xingfu Wu and Dr. Valerie Taylor from the Department of Computer Science and Engineering at Texas A&M University, have received a grant from the National Science Foundation (NSF) to conduct research that aims to improve energy efficiency of scientific applications on large-scale high-performance computing (HPC) systems by focusing on three key factors: execution time, power and resilience.
Wu is a TEES Research Associate Professor and Taylor if the senior associate dean for academic affairs and Royce E. Wisenbaker Professor.
The grant is a collaboration with Dr. Zhiling Lan, a professor of computer science at the Illinois Institute of Technology. The total grant is $500,000 with $300,000 to Texas A&M and $200,000 to Illinois Institute of Technology.
Power, execution time and resilience are paramount factors to consider in regard to the energy efficiency of parallel computing. This research aims to develop effective techniques for co-modeling and quantifying the complicated tradeoffs among parallel application execution time, power, and resilience, and to provide a tuning mechanism for user-defined metrics on HPC systems.
The experimental results and models they acquire will be used to build a framework called MuMMI_R, which stands for Multiple Metrics Modeling Infrastructure with Resilience. This framework will be an enhancement to a previous NSF large project that Taylor and Wu were a part of, MuMMI, which was completed in collaboration with Dr. Kirk Cameron from Virginia Tech, Dr. Dan Terpstra from the University of Tennessee at Knoxville, and Dr. Shirley Moore from the University of Texas at El Paso.
“While reducing execution time is still a major objective for high performance computing, future parallel systems and applications will have additional power and resilience requirements that represent a multidimensional tuning challenge,” Wu said.
Fault tolerance techniques ensure that if there does happen to be a failure inside a system, it will continue to operate correctly regardless. Because of hardware and software faults or silent data corruptions, real-world scientific applications often rely on resilience techniques to successfully finish the long executions.
While fault tolerance and power capping techniques continue to evolve, tradeoffs across execution time, power efficiency and resilience strategies are not well understood. Existing fault tolerance studies mainly focus on the tradeoffs between execution time/overhead and resiliency, whereas most power management studies focus on the tradeoffs between execution time and power.
“Understanding the tradeoffs among these three factors is crucial as future machines will be built with both reliability and power constraints,” said Taylor.
The MuMMI_R project will advance the understanding of the relationship of power, execution time and resilience on large-scale HPC systems. This will aid in better design of parallel applications, runtime systems and computer architectures in terms of energy efficiency and resilience.