WSC’17 Paper: HPC Job Scheduling Simulation

Simulation of HPC Job Scheduling and Large-Scale Parallel Workloads, Mohammad Abu Obaida and Jason Liu. In Proceedings of the 2017 Winter Simulation Conference (WSC 2017), W. K. V. Chan, A. D’Ambrogio, G. Zacharewicz, N. Mustafee, G. Wainer, and E. Page, eds., December 2017. To appear. [paper]

The paper presents a simulator designed specifically for evaluating job scheduling algorithms on large-scale HPC systems. The simulator was developed based on the Performance Prediction Toolkit (PPT), which is a parallel discrete-event simulator written in Python for rapid assessment and performance prediction of large-scale scientific applications on supercomputers. The proposed job scheduler simulator incorporates PPT’s application models, and when coupled with the sufficiently detailed architecture models, can represent more realistic job runtime behaviors. Consequently, the simulator can evaluate different job scheduling and task mapping algorithms on the specific target HPC platforms more accurately.
Not yet available.

WSC’17 Paper: HPC Simulation History

A Brief History of HPC Simulation and Future Challenges, Kishwar Ahmed, Jason Liu, Abdel-Hameed Badawy, and Stephan Eidenbenz. In Proceedings of the 2017 Winter Simulation Conference (WSC 2017), W. K. V. Chan, A. D’Ambrogio, G. Zacharewicz, N. Mustafee, G. Wainer, and E. Page, eds., December 2017. To appear. [paper]

High-performance Computing (HPC) systems have gone through many changes during the past two decades in their architectural design to satisfy the increasingly large-scale scientific computing demand. Accurate, fast, and scalable performance models and simulation tools are essential for evaluating alternative architecture design decisions for the massive-scale computing systems. This paper recounts some of the influential work in modeling and simulation for HPC systems and applications, identifies some of the major challenges, and outlines future research directions which we believe are critical to the HPC modeling and simulation community.
Not yet available.

MASCOTS’17 Paper: Energy Demand Response Scheduling

An Energy Efficient Demand-Response Model for High Performance Computing Systems, Kishwar Ahmed, Jason Liu, and Xingfu Wu. In Proceedings of the 25th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2017), September 2017.  [paper]


Demand response refers to reducing energy consumption of participating systems in response to transient surge in power demand or other emergency events. Demand response is particularly important for maintaining power grid transmission stability, as well as achieving overall energy saving. High Performance Computing (HPC) systems can be considered as ideal participants for demand-response programs, due to their massive energy demand. However, the potential loss of performance must be weighed against the possible gain in power system stability and energy reduction. In this paper, we explore the opportunity of demand response on HPC systems by proposing a new HPC job scheduling and resource provisioning model. More specifically, the proposed model applies power-bound energy-conservation job scheduling during the critical demand-response events, while maintaining the traditional performance-optimized job scheduling during the normal period. We expect such a model can attract willing articipation of the HPC systems in the demand response programs, as it can improve both power stability and energy saving without significantly compromising application performance. We implement the proposed method in a simulator and compare it with the traditional scheduling approach. Using trace-driven simulation, we demonstrate that the HPC demand response is a viable approach toward power stability and energy savings with only marginal increase in the jobs’ execution time.


Not yet available.


HPPAC’17 Paper: Energy-Aware Scheduling

When Good Enough Is Better: Energy-Aware Scheduling for Multicore Servers, Xinning Hui, Zhihui Dua, Jason Liu, Hongyang Sun, Yuxiong He, David A. Bader. In Proceedings of the 13th Workshop on High-Performance, Power-Aware Computing (HPPAC 2017), held in conjunction with 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS 2017), May 2017. [paper]

Power is a primary concern for mobile, cloud, and high-performance computing applications. Approximate computing refers to running applications to obtain results with tolerable errors under resource constraints, and it can be applied to balance energy consumption with service quality. In this paper, we propose a “Good Enough (GE)” scheduling algorithm that uses approximate computing to provide satisfactory QoS (Quality of Service) for interactive applications with significant energy savings. Given a user-specified quality level, the GE algorithm works in the AES (Aggressive Energy Saving) mode for the majority of the time, neglecting the low-quality portions of the workload. When the perceived quality falls below the required level, the algorithm switches to the BQ (Best Quality) mode with a compensation policy. To avoid core speed thrashing between the two modes, GE employs a hybrid power distribution scheme that uses the Equal-Sharing (ES) policy to distribute power among the cores when the workload is light (to save energy) and the Water-Filling (WF) policy when the workload is high (to improve quality). We conduct simulations to compare the performance of GE with existing scheduling algorithms. Results show that the proposed algorithm can provide large energy savings with satisfactory user experience.
author={X. Hui and Z. Du and J. Liu and H. Sun and Y. He and D. A. Bader},
booktitle={Proceedings of the 13th Workshop on High-Performance, Power-Aware Computing (HPPAC 2017), held in conjunction with 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS 2017)},
title={When Good Enough Is Better: Energy-Aware Scheduling for Multicore Servers},

Invited Talk: Symbiotic Modeling and High-Performance Simulation

Symbiotic Modeling and High-Performance Simulation

January 19, 2017

Department of Computer Science, Colorado School of Mines
Host: Professor Tracy Camp

Abstract: Modeling and simulation plays an important role in the design analysis and performance evaluation of complex systems. Many of these systems, such as the internet and high-performance computing systems, involve a huge number of interrelated components and processes. Complex behaviors emerge as these components and processes inter-operate across multiple scales at various granularities. Modeling and simulation must be able to provide sufficiently accurate results while coping with the scale and the complexity of these systems. My talk will focus on some of our latest advances in high-performance modeling and simulation techniques. I will focus on two specific case studies, one on network emulation and the other on high-performance computing (HPC) modeling.
In the first case, I will present a novel distributed network emulation mechanism based on modeling symbiosis. Mininet is a container-based emulation environment that can study networks consisted of virtual hosts and OpenFlow-enabled virtual switches on Linux. It is well-known, however, that experiments using Mininet may lose fidelity for large-scale networks and heavy traffic load. We propose a symbiotic approach, where an abstract network model is used to coordinate the distributed emulation instances superimposed to represent the target network. In doing so, we can effectively study the behavior of real implementation of network applications on large-scale networks in a distributed environment.
In the second case, I will present our latest work on performance modeling of HPC architectures and applications. In collaboration with the Los Alamos National Laboratory, we have developed a highly efficient simulator, called Performance Prediction Toolkit (PPT), which can facilitate rapid and accurate performance prediction of large-scale scientific applications on existing and future HPC architectures.

HPCC’16 Paper: HPC Interconnect Model

Scalable Interconnection Network Models for Rapid Performance Prediction of HPC Applications, Kishwar Ahmed, Jason Liu, Stephan Eidenbenz, and Joe Zerr. In Proceedings of the 18th International Conference on High Performance Computing and Communications (HPCC 2016), December 2016. [paper] [slides]

Performance Prediction Toolkit (PPT) is a simulator mainly developed at Los Alamos National Laboratory to facilitate rapid and accurate performance prediction of large-scale scientific applications on existing and future HPC architectures. In this paper, we present three interconnect models for performance prediction of large-scale HPC applications. They are based on interconnect topologies widely used in HPC systems: torus, dragonfly, and fat-tree. We conduct extensive validation tests of our interconnect models, in particular, using configurations of existing HPC systems. Results show that our models provide good accuracy for predicting the network behavior. We also present a performance study of a parallel computational physics application to show that our model can accurately predict the parallel behavior of large-scale applications.
author={K. Ahmed and J. Liu and S. Eidenbenz and J. Zerr},
booktitle={Proceedings of the IEEE 18th International Conference on High Performance Computing and Communications (HPCC)},
title={Scalable Interconnection Network Models for Rapid Performance Prediction of HPC Applications},

REHPC’16 Paper: Program Power Profiling Based on Phases

Fast and Effective Power Profiling of Program Execution Based on Phase Behaviors, Xiaobin Ma, Zhihui Du and Jason Liu. In Proceedings of the 1st International Workshop on Resilience and/or Energy-Aware Techniques for High-Performance Computing (RE-HPC 2016), held in conjunction with the 7th International Green and Sustainable Computing Conference (IGSC 2016), November 2016. [paper]

Power profiling tools based on fast and accurate workload analysis can be useful for job scheduling and resource allocation aiming to optimize the power consumption of large-scale high-performance computer systems. In this paper, we propose a novel method for predicting the power consumption of a complete workload or application by extrapolating the power consumption of only a few code segments of the same application obtained from measurement. As such, it provides a fast and yet effective way for predicting the power consumption of a single-threaded execution of a program on arbitrary architectures without having to profile the entire program’s execution. The latter would be costly to obtain, especially if it’s a long running program. Our method employs a set of code analysis tools to capture the program’s phase behavior and then adopts a multi-variable linear regression method to estimate the power consumption of the entire program. We use SPEC 2006 benchmark to evaluate the accuracy and effectiveness of our method. Experimental results show that our power profiling method achieves good accuracy in predicting program’s energy use with relatively small errors.
author={Xiaobin Ma and Zhihui Du and Jason Liu},
booktitle={Proceedings of the 7th International Green and Sustainable Computing Conference (IGSC)},
title={Fast and effective power profiling of program execution based on phase behaviors},

PADS’16 Paper: Integrated Interconnect Model

An Integrated Interconnection Network Model for Large-Scale Performance Prediction, Kishwar Ahmed, Mohammad Obaida, Jason Liu, Stephan Eidenbenz, Nandakishore Santhi, and Guillaume Chapuis. In Proceedings of the 2016 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (SIGSIM-PADS 2016), May 2016. [paper]

Interconnection network is a critical component of high- performance computing architecture and application co-design. For many scientific applications, the increasing communication complexity poses a serious concern as it may hinder the scaling properties of these applications on novel architectures. It is apparent that a scalable, efficient, and accurate interconnect model would be essential for performance evaluation studies. In this paper, we present an interconnect model for predicting the performance of large-scale applications on high-performance architectures. In particular, we present a sufficiently detailed interconnect model for Cray’s Gemini 3-D torus network. The model has been integrated with an implementation of the Message-Passing Interface (MPI) that can mimic most of its functions with packet-level accuracy on the target platform. Extensive experiments show that our integrated model provides good accuracy for predicting the network behavior, while at the same time allowing for good parallel scaling performance.
author = {Ahmed, Kishwar and Obaida, Mohammad and Liu, Jason and Eidenbenz, Stephan and Santhi, Nandakishore and Chapuis, Guillaume},
title = {An Integrated Interconnection Network Model for Large-Scale Performance Prediction},
booktitle = {Proceedings of the 2016 Annual ACM Conference on SIGSIM Principles of Advanced Discrete Simulation},
series = {SIGSIM-PADS ’16},
year = {2016},
isbn = {978-1-4503-3742-7},
location = {Banff, Alberta, Canada},
pages = {177–187},
numpages = {11},
url = {},
doi = {10.1145/2901378.2901396},
acmid = {2901396},
publisher = {ACM},
address = {New York, NY, USA},