Thursday June 28th
8.30 am : Henri Casanova Random Topologies for HPC platforms.
As the scales of parallel applications and platforms increase the negative impact of communication latencies on per formance becomes large. Fortunately, modern High Performance Computing (HPC) systems can exploit lowlatency topologies of highradix switches. In this talk I'll describe the use of random shortcut topologies, which are generated by augmenting classical topologies with random links. Graph analysis shows that these topologies, when compared to nonrandom topologies of the same degree, lead to drastically reduced diameter and average shortest path length. The best results are obtained when adding random links to a ring topology, meaning that good random shortcut topologies can easily be generated for arbitrary numbers of switches. Flitlevel discrete event simulation shows that random shortcut topologies achieve throughput comparable to and latency lower than that of existing nonrandom topologies such as hypercubes and tori. Random topologies give rise to several practical challenges, including routing scalability and larger physical cable lengths, which will be discussed.
9.15 am : Yves Robert  Unified Model for Assessing Checkpointing Protocols at ExtremeScale
Authors: G. Bosilca, A. Bouteiller, E. Brunet, F. Cappello,J. Dongarra, A. Guermouche, Th. Herault, Y.Robert, F. Vivien, and D. Zaidouni.
In this talk, we present a unified model for several wellknown checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space: on one side the coordinated checkpoint, and on the other extreme, a variety of uncoordinated checkpoint strategies (with message logging).
We identify a set of parameters that are crucial to instantiate and compare the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then propose a detailed analysis of several scenarios, including some of the most powerful currently available HPC platforms, as well as anticipated Exascale designs. This comparison outlines the comparative behaviors of checkpoint strategies at scale, thereby providing insight that is hardly accessible to direct experimentation.
10.30 am : Nicolas Maillard  Online Scheduling of Parallel Programs on Hybrid Architectures.
This presentation will present an effort to improve the scheduling techniques on hybrid architectures (MultiCPUs/MultiGPUs), for irregular and/or small grained parallel programs. For such programs, online strategies are used, usually based on Work Stealing, which has been devised for homogeneous architectures. Two series of results will be presented on hybrid platforms, the first one with a task deque handled at the GPU level, and the second one based on assynchronous communication between GPU and CPU.
11.15 am : Thomas Herault  The DAGuE framework
In this talk, I will present the DAGuE framework. DAGuE aims at enabling scientific computing on large scale distributed environments featuring many cores, accelerators and high speed networks. The framework includes libraries, a runtime system, and development tools to help application developers tackle the difficult task of porting their applications to highly heterogeneous and diverse environment. Within the DAGuE runtime, tasks are dynamically scheduled between the computing units of nodes, based on a parameterized synthetic representation of the data flow. The talk will focus on this representation, its expressivity, and the scheduling strategies of the runtime. I will also briefly introduce the productive tools available for the users to tune and evaluate their data flow efficiency.
1.30 pm : Olivier Beaumont  Bin Packing and Server Consolidation
2.15 pm : Stephen Scott
3.30 pm : Thilo Kielmann  Task Farming in Cloud Environments under Budget Constraints.
4.15 pm : Frédéric Vivien  Checkpointing strategies for parallel jobs.
We provide an analysis of checkpointing strategies for minimizing expected job execution times in an environment that is subject to processor failures. In the case of both sequential and parallel jobs, we give the optimal solution for exponentially distributed failure interarrival times, which, to the best of our knowledge, is the first rigorous proof that periodic checkpointing is optimal. For nonexponentially distributed failures, we develop a dynamic programming algorithm to maximize the amount of work completed before the next failure, which provides a good heuristic for minimizing the expected execution time. Our work considers various models of job parallelism and of parallel checkpointing overhead. We first perform extensive simulation experiments assuming that failures follow Exponential or Weibull distributions, the latter being more representative of realworld systems. The obtained results not only corroborate our theoretical findings, but also show that our dynamic programming algorithm significantly outperforms previously proposed solutions in the case of Weibull failures. We then discuss results from simulation experiments that use failure logs from production clusters. These results confirm that our dynamic programming algorithm significantly outperforms existing solutions for realworld clusters.
Friday June 29th
8.15 am : Julien Langou  Hierarchical QR factorization algorithms for multicore cluster systems.
Authors: Jack Dongarra, Mathieu Faverge, Thomas Herault, Julien Langou, and Yves Robert.
This paper describes a new QR factorization algorithm which is especially designed for massively parallel platforms combining parallel distributed multicore nodes. These platforms make the present and the foreseeable future of highperformance computing. Our new QR factorization algorithm falls in the category of the tile algorithms which naturally enables good data locality for the sequential kernels executed by the cores (high sequential performance), low number of messages in a parallel distributed setting (small latency term), and fine granularity (high parallelism).
9.00 am : Mathieu Faverge
PHD SESSION  10.15 am12.00 pm
10.15 am : Ahmed Abousamra  Dejavu Switching for multiplane Network on a chip.
In chipmultiprocessors (CMPs) the networkonchip (NoC) carries cache coherence and data messages. These messages may be classified into critical and noncritical messages. Hence, instead of having one interconnect plane to serve all traffic, power can be saved if the NoC is split into two planes: a fast plane dedicated to the critical messages and a slower, more powerefficient plane dedicated only to the noncritical messages. This split, however, can be beneficial to save energy only if system performance is not significantly degraded by the slower plane. In this work we first motivate the need for a timely delivery of the “noncritical” messages. Second, we propose D´ej`a Vu switching, a simplealgorithm that enables reducing the voltage and frequency of one plane while reducing communication latency through circuit switching and support
of advance, possibly conflicting, circuit reservations. Finally, we study the constraints that govern how slow the powerefficient plane can operate without negatively impacting system performance.
10.30 am : Daniel Cole  Speed Scaling for Stretch Plus Energy.
We consider speed scaling problems where the objective is to minimize a linear combination of arbitrary scheduling objective S, and energy E. A natural conjecture is that there is an O(1)competitive algorithm for S on a fixed speed processor if and only if there is an O(1)competitive algorithm for S + E on a processor with an arbitrary power function. We give evidence to support this conjecture by providing an O(1)competitive algorithm for the objective of integer stretch plus energy.
10.45 am : Kim Klein  Robust algorithms for online optimization problems.
We consider online optimization problems (i.e. online scheduling), where jobs arrive by time. At each time t, the algorithm has to provide a good solution for the current instance. In contrast to the classical online scheduling problem, a robust algorithm is allowed to reassign a certain amount of jobs whenever a new job arrives. We discuss some rounding and LP techniques to derive efficient robust algorithms.
11.00 am : Bradley Lowery  A New Algorithm for the Collective Communication AlltoOne Operation Reduce.
We study the pipelined reduction on a uniform unidirectional network of distributed processes under the linear (Hockney) model. We present a new algorithm based on a greedy tree which we theoretically compare with three standard algorithms: binary, binomial, and pipeline. The segmentation of the messages is initially constrained to be uniform and each algorithm is studied with its own optimal block size. We recall wellknown results: (1) for short size messages, the binomial algorithm is better than the binary and the pipeline ones; (2) for medium size messages, the binary algorithm is better than the binomial and the pipeline ones; (3) for long size messages, the pipeline algorithm is better than the binomial and the binary ones. Our new algorithm (greedy) outperforms all these algorithms for any message sizes (given appropriate tuning of the message segmentation). Indeed we can prove it is optimal for a given segmentation. For example, for 64 processes, neglecting computation, and with a latency ten times larger than the inverse of the bandwidth, (bandwidth in bytes per unit of time,) the greedy algorithm theoretically performs as binary for small messages, 50% better than binary for medium messages (which here is defined by 1 KB to 4 KB), and slightly better than pipeline for large messages. Secondly, we present experimental results where we compare MPI implementation of the four algorithms (greedy, binary, binomial and pipeline) based on MPI_Send and MPI_Recv protocols together with MPI_Reduce on 64 nodes for message of size 4 Bytes to 16 MB. We recover the trends explained previously. Our implementation of the greedy algorithm is in practice more efficient than MPI_Reduce except for very large messages (more than 4 MB). Finally we study theoretically nonuniform segmentations of the messages. We find that, for the greedy algorithm, for some configuration,nonuniform segmentations do perform better than any uniform segmentation. For example, for 12 processes, neglecting latency, and a message of 10 units, we find that the best uniform segmentation for greedy is (1,1,...,1,1) while the best nonuniform segmentation is (2,2,1,1,1,1,1,1). The nonuniform segmentation is 7% better than the best uniform segmentation.
11.15 am : Adrian Muresan  Finding allocations for Budgetconstrained PTG workflow applications.
Many scientific applications can be described through workflows due to their inherent structure or to the fact that they are built by aggregating other smaller applications and adding flow connections between them.
Modern computational architectures are evolving in the direction of parallel computing through multicore, the generalization of GPUs and compute clusters. This leads to applications being composed of not only sequential tasks, but also parallel ones which leads to more efficient ways of exploiting modern hardware. In the case of workflow applications, a task is called moldable if it can be run on a variable number of resources chosen at scheduling time and the workflow itself is called a Parallel Task Graph (PTG).
Although most workflow applications can be fully described by a PTG,there are some that can only be described by a PTG that is augmented with special semantics for exclusive diverging control flows or repetitive flows. The resulting PTG is one that can have nondeterministic transitions and repetitive constructs.
The proliferation of Cloud Computing has led to a new way of managing resources: ondemand as opposed to static allocations. PTG applications can take advantage of this, but the classic approached to resource
allocation and scheduling need to be adapted for ondemand resource provisioning. As a starting point, the execution of an application now corresponds to a budget, that translates to a virtual platform size.
The current work presents an approach for determining budgetconstrained resource allocations for nondeterministic PTG workflows, on IaaS platforms. We solve this problem by decomposing the initial workflow
into a set of deterministic subworkflows, for which we find allocations by adapting a wellknown existing allocation algorithm.
11.30 am : Paul RenaudGoud  Poweraware Manhattan routing on chip multiprocessors.
We investigate the routing of communications in chip multiprocessors (CMPs). The goal is to find a valid routing in the sense that the amount of data routed between two neighboring cores does not exceed the maximum link bandwidth while the power dissipated by communications is minimized. Therefore, we consider a set of communications that have to be routed between the cores of the CMP. The frequency of the links is scalable and it is proportional to the throughput of the link. The most natural and widely used algorithm to handle all these communications is XY routing. However, if it is allowed to use all Manhattan paths between the source and the destination, the consumed power can be reduced dramatically. We compare XY routing and Manhattan routing, both from a theoretical and from a practical point of view.
We establish the NPcompleteness of the problem of finding a Manhattan routing that minimizes the dissipated power, we exhibit the minimum upper bound of the ratio power consumed by an XY routing over power consumed by a Manhattan routing, and finally we perform simulations to assess the performance of Manhattan routing heuristics that we designed.
11.45 am : Dounia Zaidouni On the complexity of scheduling checkpoints for computational workflows.
This paper deals with the complexity of scheduling computational workflows in the presence of Exponential failures. When such a failure occurs, rollback and recovery is used so that the execution can resume from the last checkpointed state. The goal is to minimize the expected execution time, and we have to decide in which order to execute the tasks, and whether to checkpoint or not after the completion of each given task. We show that this scheduling problem is strongly NPcomplete, and propose a (polynomialtime) dynamic programming algorithm for the case where the application graph is a linear chain. These results lay the theoretical foundations of the problem, and constitute a prerequisite before discussing scheduling strategies for arbitrary DAGS of moldable tasks subject to general failure distributions.
Saturday June 30th
8.30 am : Kunal Agrawal  Scheduling streaming applications in order to minimize the number of cachemisses.
We consider the problem of scheduling streaming applications in order to minimize the number of cachemisses. Streaming applications are represented as a directed graph (or multigraph), where nodes are computation modules and edges are channels. When a module fires, it consumes some dataitems from its input channels and produces some items on its output channels. In addition, each module may have some state (either code or
data) which represents the memory locations that must beloaded into cache in order to execute the module.
Our main contribution is to show that for a large and important class of streaming computations,cacheefficient scheduling is essentially equivalent to solving a constrained graph partitioning problem. A streaming computation from this class has a cacheefficient schedule if and only if its graph has a lowbandwidth partition of the modules into components (subgraphs) whose total state fits within the
cache, where the bandwidth of the partition is the number of data items that cross intercomponent channels per data item that enters the graph.
9.15 am : Anthony Danalis  Automatic generation of tasks for the DAG scheduling engine DAGuE.
10.30 am : Aurélien Bouteiller  Fault Tolerance Techniques for MPI programs.
Mean time to interrupt on HPC leadership system is expected to be very low in the next generation, which would provoke the current manageable failure issues to worsen significantly. In this presentation, an outlook of the available fault tolerant techniques, from system approaches to application driven recovery, is given and the challenges hindering their wide adoption are discussed.
11.15 am : Emmanuel Jeannot
