WSC’17 Paper: HPC Job Scheduling Simulation

Simulation of HPC Job Scheduling and Large-Scale Parallel Workloads, Mohammad Abu Obaida and Jason Liu. In Proceedings of the 2017 Winter Simulation Conference (WSC 2017), W. K. V. Chan, A. D’Ambrogio, G. Zacharewicz, N. Mustafee, G. Wainer, and E. Page, eds., December 2017. To appear. [paper]

The paper presents a simulator designed specifically for evaluating job scheduling algorithms on large-scale HPC systems. The simulator was developed based on the Performance Prediction Toolkit (PPT), which is a parallel discrete-event simulator written in Python for rapid assessment and performance prediction of large-scale scientific applications on supercomputers. The proposed job scheduler simulator incorporates PPT’s application models, and when coupled with the sufficiently detailed architecture models, can represent more realistic job runtime behaviors. Consequently, the simulator can evaluate different job scheduling and task mapping algorithms on the specific target HPC platforms more accurately.
Not yet available.

WSC’17 Paper: HPC Simulation History

A Brief History of HPC Simulation and Future Challenges, Kishwar Ahmed, Jason Liu, Abdel-Hameed Badawy, and Stephan Eidenbenz. In Proceedings of the 2017 Winter Simulation Conference (WSC 2017), W. K. V. Chan, A. D’Ambrogio, G. Zacharewicz, N. Mustafee, G. Wainer, and E. Page, eds., December 2017. To appear. [paper]

High-performance Computing (HPC) systems have gone through many changes during the past two decades in their architectural design to satisfy the increasingly large-scale scientific computing demand. Accurate, fast, and scalable performance models and simulation tools are essential for evaluating alternative architecture design decisions for the massive-scale computing systems. This paper recounts some of the influential work in modeling and simulation for HPC systems and applications, identifies some of the major challenges, and outlines future research directions which we believe are critical to the HPC modeling and simulation community.
Not yet available.

BigData’17 Paper: Light Curve Anomaly Detection

Real-Time Anomaly Detection of Short Time-Scale GWAC Survey Light Curves, Tianzhi Feng, Zhihui Du, Yankui Sun, Jianyan Wei, Jing Bi, and Jason Liu. In Proceedings of 6th IEEE International Congress on Big Data, June 2017. [paper]

Ground-based Wide-Angle Camera array (GWAC) is a short time-scale survey telescope that can take images covering a field of view of over 5,000 square degrees every 15 seconds or even shorter. One scientific missions of GWAC is to accurately and quickly detect anomaly astronomical events. For that, a huge amount of data must be handled in real time. In this paper, we propose a new time series analysis model, called DARIMA (or Dynamic Auto-Regressive Integrated Moving Average), to identify the anomaly events that occur in light curves obtained from GWAC as early as possible with high degree of confidence. A major advantage of DARIMA is that it can dynamically adjust its model parameters during the real-time processing of the time series data. We identify the anomaly points based on the weighted prediction result of different time windows to improve accuracy. Experimental results using real survey data show that the DARIMA model can identify the first anomaly point for all light curves. We also evaluate our model with simulated anomaly events of various types embedded in the real time series data. The DARIMA model is able to generate the early warning triggers for all of them. The results from the experiments demonstrate that the proposed DARIMA model is a promising method for real-time anomaly detection of short time-scale GWAC light curves.
Not yet available.

SIMUTOOLS’17 Paper: Improving Real-Time SDN Simulation

On Improving Parallel Real-Time Network Simulation for Hybrid Experimentation of Software Defined Networks, Mohammad Abu Obaida and Jason Liu. In Proceedings of the 10th EAI International Conference on Simulation Tools and Techniques (SIMUTOOLS 2017), September 2017. To appear. [paper]

Real-time network simulation enables simulation to operate in real time, and in doing so allows experiments with simulated, emulated, and real network components acting in concert to test novel network applications or protocols. Real-time simulation can also run in parallel for large-scale network scenarios, in which case network traffic is represented as simulation events passed as messages to remote simulation instances running on different machines. We note that substantial overhead exists in parallel real-time simulation to support synchronization and communication among distributed instances, which can significantly limit the performance and scalability of the hybrid approach. To overcome these challenges, we propose several techniques for improving the performance of parallel real-time simulation, by eliminating parallel synchronization and reducing communication overhead. Our experiments show that the proposed techniques can indeed improve the overall performance. In a use case, we demonstrate that our hybrid technique can be readily integrated for studies of software-defined networks.
Not available yet

MASCOTS’17 Paper: Energy Demand Response Scheduling

An Energy Efficient Demand-Response Model for High Performance Computing Systems, Kishwar Ahmed, Jason Liu, and Xingfu Wu. In Proceedings of the 25th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2017), September 2017.  [paper]


Demand response refers to reducing energy consumption of participating systems in response to transient surge in power demand or other emergency events. Demand response is particularly important for maintaining power grid transmission stability, as well as achieving overall energy saving. High Performance Computing (HPC) systems can be considered as ideal participants for demand-response programs, due to their massive energy demand. However, the potential loss of performance must be weighed against the possible gain in power system stability and energy reduction. In this paper, we explore the opportunity of demand response on HPC systems by proposing a new HPC job scheduling and resource provisioning model. More specifically, the proposed model applies power-bound energy-conservation job scheduling during the critical demand-response events, while maintaining the traditional performance-optimized job scheduling during the normal period. We expect such a model can attract willing articipation of the HPC systems in the demand response programs, as it can improve both power stability and energy saving without significantly compromising application performance. We implement the proposed method in a simulator and compare it with the traditional scheduling approach. Using trace-driven simulation, we demonstrate that the HPC demand response is a viable approach toward power stability and energy savings with only marginal increase in the jobs’ execution time.


Not yet available.


HPPAC’17 Paper: Energy-Aware Scheduling

When Good Enough Is Better: Energy-Aware Scheduling for Multicore Servers, Xinning Hui, Zhihui Dua, Jason Liu, Hongyang Sun, Yuxiong He, David A. Bader. In Proceedings of the 13th Workshop on High-Performance, Power-Aware Computing (HPPAC 2017), held in conjunction with 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS 2017), May 2017. [paper]

Power is a primary concern for mobile, cloud, and high-performance computing applications. Approximate computing refers to running applications to obtain results with tolerable errors under resource constraints, and it can be applied to balance energy consumption with service quality. In this paper, we propose a “Good Enough (GE)” scheduling algorithm that uses approximate computing to provide satisfactory QoS (Quality of Service) for interactive applications with significant energy savings. Given a user-specified quality level, the GE algorithm works in the AES (Aggressive Energy Saving) mode for the majority of the time, neglecting the low-quality portions of the workload. When the perceived quality falls below the required level, the algorithm switches to the BQ (Best Quality) mode with a compensation policy. To avoid core speed thrashing between the two modes, GE employs a hybrid power distribution scheme that uses the Equal-Sharing (ES) policy to distribute power among the cores when the workload is light (to save energy) and the Water-Filling (WF) policy when the workload is high (to improve quality). We conduct simulations to compare the performance of GE with existing scheduling algorithms. Results show that the proposed algorithm can provide large energy savings with satisfactory user experience.
author={X. Hui and Z. Du and J. Liu and H. Sun and Y. He and D. A. Bader},
booktitle={Proceedings of the 13th Workshop on High-Performance, Power-Aware Computing (HPPAC 2017), held in conjunction with 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS 2017)},
title={When Good Enough Is Better: Energy-Aware Scheduling for Multicore Servers},

ICC’17 Paper: Mininet Symbiosis

Distributed Mininet with Symbiosis, Rong Rong and Jason Liu. In Proceedings of the IEEE International Conference on Communications (ICC 2017), May 2017.  [paper]

Mininet is a container-based emulation environment that can study networks with virtual hosts and OpenFlow- enabled virtual switches on Linux. However, it is well-known that experiments using Mininet may lose fidelity for large- scale networks and heavy traffic load. One solution is to use a distributed setup where an experiment constitutes multiple instances of Mininet running on a cluster, each handling a subset of virtual hosts and switches. Such arrangement, however, is still constrained by bandwidth and latency limitations in the physical connection between the instances. In this paper, we propose a novel method of integrating distributed Mininet instances using a symbiotic approach, which extends an existing method for combining real-time simulation and emulation. We use an abstract network model to coordinate the distributed instances, which are superimposed to represent the target network. In this case, one can more effectively study the behavior of real imple- mentation of network applications on large-scale networks, since the interaction between the Mininet instances is only capturing the effect of contentions among network flows in shared queues, as opposed to having to exchange individual network packets, which can be limited by bandwidth or sensitive to latency. We provide a prototype implementation of the new approach and present validation studies to show it can achieve accurate results. We also present a case study that successfully replicates the behavior of a denial-of-service (DoS) attack protocol.
author={R. Rong and J. Liu},
booktitle={2017 IEEE International Conference on Communications (ICC)},
title={Distributed mininet with symbiosis},

ICBDA’17 Paper: MOOC Learning Zipf Law

Zipf’s Law in MOOC Learning Behavior, Chang Men, Xiu Li, Zhihui Du, Jason Liu, Manli Li, and Xiaolei Zhang. In Proceedings of the 2nd IEEE International Conference on Big Data Analysis (ICBDA 2017), March 2017. [paper]

Learners participating in Massive Open Online Courses (MOOC) have a wide range of backgrounds and motivations. Many MOOC learners sign up the courses to take a brief look; only a few go through the entire content, and even fewer are able to eventually obtain a certificate. We discovered this phenomenon after having examined 76 courses on the xuetangX platform. More specifically, we found that in many courses the learning coverage—one of the metrics used to estimate the learners’ active engagement with the online courses—observes a Zipf distribution. We apply the maximum likelihood estimation method to fit the Zipf’s law and test our hypothesis using a chi-square test. The result from our study is expected to bring insight to the unique learning behavior on MOOC and thus help improve the effectiveness of MOOC learning platforms and the design of courses.
Not yet available.

HPCC’16 Paper: HPC Interconnect Model

Scalable Interconnection Network Models for Rapid Performance Prediction of HPC Applications, Kishwar Ahmed, Jason Liu, Stephan Eidenbenz, and Joe Zerr. In Proceedings of the 18th International Conference on High Performance Computing and Communications (HPCC 2016), December 2016. [paper] [slides]

Performance Prediction Toolkit (PPT) is a simulator mainly developed at Los Alamos National Laboratory to facilitate rapid and accurate performance prediction of large-scale scientific applications on existing and future HPC architectures. In this paper, we present three interconnect models for performance prediction of large-scale HPC applications. They are based on interconnect topologies widely used in HPC systems: torus, dragonfly, and fat-tree. We conduct extensive validation tests of our interconnect models, in particular, using configurations of existing HPC systems. Results show that our models provide good accuracy for predicting the network behavior. We also present a performance study of a parallel computational physics application to show that our model can accurately predict the parallel behavior of large-scale applications.
author={K. Ahmed and J. Liu and S. Eidenbenz and J. Zerr},
booktitle={Proceedings of the IEEE 18th International Conference on High Performance Computing and Communications (HPCC)},
title={Scalable Interconnection Network Models for Rapid Performance Prediction of HPC Applications},

WSC’16 Paper: Simulation Reproducibility

Panel – Reproducible Research in Discrete-Event Simulation – A Must or Rather a Maybe? Adelinde M. Uhrmacher, Sally Brailsford, Jason Liu, Markus Rabe, and Andreas Tolk. In Proceedings of the 2016 Winter Simulation Conference (WSC 2016), T. M. K. Roeder, P. I. Frazier, R. Szechtman, E. Zhou, T. Huschka, and S. E. Chick, eds., December 2016. [paper]

Scientific research should be reproducible, and as such also simulation research. However, the question is – is this really the case? In some application areas of simulation, e.g., cell biology, simulation studies cannot be published without data, models, methods, including computer code being made available for evaluation. With the applications and methodological areas of modeling and simulation, how the problem of reproducibility is assessed and addressed differs. The diversity of answers to this question will be illuminated by looking into the area of network simulations, simulation in logistics, in military, and health. Making different scientific cultures, different challenges, and different solutions in discrete event simulation explicit is central to improving the reproducibility and thus quality of discrete event simulation research.
author={A. M. Uhrmacher and S. Brailsford and J. Liu and M. Rabe and A. Tolk},
booktitle={2016 Winter Simulation Conference (WSC)},
title={Panel–Reproducible research in discrete event simulation–A must or rather a maybe?},