The architectural characteristics of high-performance computing platforms is constantly evolving. As a consequence, new algorithmic problems arise and existing algorithmic solutions need to be revisited. Among the new characteristics of modern and emerging platforms, there are their deep hierarchical structure, the increasing importance of memory constraints (memory wall) and of energy issues, the decrease of the platform mean-time between failures as the number of computing elements increase, the many sources of uncertainties and dynamicity, etc.
In this lecture series we will consider some of these problems and present some techniques that can be used to solve them. We will start by considering the overall problem of resource allocation and load-balancing. We will stress the many sources of uncertainties and dynamicity which can hinder algorithm design. We will then present some classical techniques to overcome some of these problems (like work-stealing and online algorithms) but also new ones (like the supervision of virtual machines). In a second set of lectures we will consider the problems linked to memory and energy issues. The series of lectures will finally consider the problem of fault-tolerance.
Each student will be given two research articles to synthesize, compare, and criticize. This work will be evaluated through a written report and an oral presentation.
Note: only the second of the two lectures planned on fault-tolerance was (partially) covered during the 2012 Research school on fault-tolerance
Last modified: Fri Jun 14 15:00:57 CEST 2013