Research

Documents

a list of my publications (journals, conference's articles and freely available research reports)
some material from the talks I have given

Pipelining Collective Communications

Motivations and Principles

Recently, the advances in high performance networks and in communication library dedicated to distributed computing make it possible to use heterogeneous network of workstations instead of dedicated expensive supercomputers such as massively parallel calculators. However, even if it is now possible to obtain tremendous performances on a given network, it is difficult to keep these performances as soon as we gather several networks (in the case of Grid computing). This is mainly due to the heterogeneity of the obtained platform, and the effect of bottlenecks. Moreover, discovering the topology of the computing platform is often challenging, although it is crucial for the efficient deployment of grid applications. Considering the communications involved in such distributed applications, we can gather them into some macro-communication schemes, such as broadcasts, scatters, all-to-all or reduce operations. These macro-communication schemes have often been studied with the goal of minimizing their makespan, i.e. the time elapsed between the emission of the first message by the source, and the last reception. But in many cases, the application has to perform a large number of instances of the same operation (for example if data parallelism is used). When dealing with such a series of macro-communications, pipelining is mandatory to achieve good performance. The relevant objective becomes the optimization of the throughput, that is the average number of macro-communications executed per time-unit.

Results

We have studied these collective communications in the context of throughput maximization, and thus designed scheduling strategies with optimal throughput (leading to asymptotically optimal schedules with respect to the classical make-span metric) for most of the classical collective communication primitives. Here is a summary of our work in this field:

We first studied the Scatter and Reduce operations, and developed a new framework for studying this problem, using a relaxation based on linear programming. This results in an article in APDCM'04. An extended journal version (in JPDC) will be soon available.
Then, we moved to the study of the most famous collective communication schemes: Broadcast and Multicast. As replication of the messages is allowed for these communications (and necessary to reach good performances), these problems are more challenging than the previous ones. However, using graph techniques, we are able to design the same kind of schedules (optimal with the respect of the throughput metric) for the broadcast problem (see the article in IPDPS'04, and the extended version in the IEEE TPDS journal). The complexity of Multicast operations turns out to be more difficult: we have shown that this is in fact an NP-complete problem (see the article in ICPP'04).
All previous studies assume a bidirectional one-port model: at each time-step, each machine is available to perform at most one sending and one receiving communications. Although this model is widely accepted, we tried to adapt our results to the unidirectional one-port model, when a machine cannot perform simultaneously a sending and a receiving operations. Surprisingly, even the Scatter problem turns out to be much more complicated than in the bidirectional case, but we design an approach based on the ellipsoid method for solving big (non polynomial) linear problems (this technical method is available as a research report and in our IJHPCA paper ).
As previous approach for solving problems such as Broadcasting a series of messages may lead to a big control overhead (as we are using several concurrent broadcast trees), we have designed some heuristics to get good performances while keeping the control complexity low (that is using a single broadcast tree). We present this work using a large range of platform models in the article in IPDPS'05.

Other scheduling issues

Steady-State Relaxation

The idea of relaxation makespan minimization problems using linear programming techniques was also applied to some classical scheduling problems.
We studied the problem of scheduling series (or stream) of identical DAGs with different input data on heterogeneous platforms, and found a way to compute a periodic throughput reaching optimal throughput for simple DAGs (that is with bounded depth of dependencies). This work is available as a journal article in PPL. This limitation to bounded dependencies is not articifial: we have shown that the general problem is NP-complete: see the corresponding research report for a complete (and rather technical) view, of the article in HeteroPar'04 which summarizes the impact and limits of this steady-state approach.

From steady state theory to concrete scheduling problems

The previous study of steady-state scheduling has been extended to several problems, which are briefly stated here:

Independant tasks scheduling In an article presented at PDP'05, we have studied the impact of introducing memory constraints to the classical problems of scheduling independent tasks on a master-slave platform.

Scheduling divisible load applications on a large-scale network Using a new model for the communications between sites in a Grid computing environment, we have investigated a new framework for scheduling divisible load on such a platform. This work is available as a research report and an IPDPS'05 article.

Scheduling multiple bag-of-tasks applications We have studied the problem of scheduling multiple applications consisting in a large number of same-size independent tasks. The steady-state approach gives an optimal schedule, but need all the information on the platform to be gathered at the master. Thus we have investigated heuristics to give a distributed answer to this problem. This has been presented IPDPS 2006 (slides available in the talks section)

Simulation

When dealing with scheduling issues on distributed platforms, the question of the justifications of the algorithms and heuristics proposed is a key problem.
One approach is to perform experiments with real applications on real resources. However, modern computing platforms are increasingly distributed and often span multiple administrative domains. Therefore, resource availability fluctuations make it impossible to conduct repeatable experiments for relatively long running applications. Another problem is that the number of platform configurations that can be explored is limited. As a results of these difficulties with real experimentations, most researchers have resorted to discrete-event simulation.
The main advantage of using simulation is the ability to compare several algorithms on a wide variety of heterogeneous platforms (thanks to fast execution of simulations) ensuring the reproducibility of measured data, which is not possible on a distributed platform which is not dedicated to the experiments.
These are the motivations for the development of the SimGrid toolkit, which allows users to simulate distributed algorithms in a heterogeneous distributed environment. I have participated in the design of a new network model for this simulator. The technical results of this work can be found in the corresponding research report, while an article in CCGrid'03 presents the second version of this simulator (which takes new model into account).