GRI proposal funded

Dec 2, 2018

Monitoring Health Status of High Performance Computing Systems

Monitoring data centers is challenging due to their size, complexity, and dynamic nature. This project proposes a visual approach for situational awareness and health monitoring of high-performance computing systems. The visualization requirements are expanded on the following dimensions: 1) High performance computing spatial layout, 2) Temporal domain (historical vs. real-time tracking), and 3) System health services such as temperature, CPU load, memory usage, fan speed, and power consumption. We demonstrate the developed prototype on a medium-scale data center of 10 racks and 467 hosts.

The work was developed in collaboration with both industrial and acadamic domain experts:

  • Dr. Yong Chen, Department of Computer Science, Texas Tech University.
  • Dr. Alan Sill, Managing Director of HPCC; Co-Director, NSF CAC.
  • Jon Hass, SW Architect at Dell Inc.; Chairman of the board, DMTF.

Students:

  • Ngan Nguyen, PhD student, Department of Computer Science, Texas Tech University.
  • Ghazanfar Ali, PhD student. Department of Computer Science, Texas Tech University.