3rd Scheduling in Aussois Workshop; Talks

Emmanuel Agullo (speaker), Luc Giraud, Abdou Guermouche, and Jean Roman

Complexity analysis of sparse hybrid linear solvers

Abstract: Sparse hybrid solvers are a trade-off between direct methods and iterative methods. Part of the computation is first performed with a direct method in order to ensure numerical robustness; the algorithm then switches to an iterative method to alleviate the computational complexity and memory usage. The convergence and number of iterations in the second step depend on the amount of computation performed in the direct part. In this talk, we propose a model to compute the subsequent computational and memory costs, depending on the ratio of computation performed in direct. We apply this model on several classes of academic problems and illustrate the possible trade-offs in terms of computational and memory complexity.

slides

Lionel Eyraud-Dubois and Olivier Beaumont

Bounded Multi Port Model: Motivations, Feasability & Application to Broadcast

Abstract: The bounded multi-port model is a model for communications in distributed computing platforms, in which each node may be involved in several simultaneous communications, provided bandwidth constraints are respected. In highly heterogeneous environment, this model allows to make better use of the available bandwidth of large server nodes. In this talk, we discuss its feasability by showing that simple optimal algorithms exist for basic problems, and we will show that their implementation require to use QoS mechanisms to enforce the prescribed bandwidth sharing. We also study steady-state application-level broadcast, with an additional degree constraint to ensure the realism of the obtained solution. For this particular problem, we provide a polynomial algorithm based on a minimal resource augmentation.

slides

Rob H. Bisseling (speaker), Bas O. Fagginger Auer, and Albert-Jan N. Yzelman

Sparse matrix partitioning, ordering, and visualisation by Mondriaan 3.0

Abstract: This talk presents two combinatorial problems encountered in scientific computations on today's high-performance architectures, such as parallel computers with many processors and several cores on each processor, and with sophisticated memory hierarchies and multiple levels of cache.

For parallelism, the most important problem is to partition sparse matrices, graphs, or hypergraphs into nearly equal-sized parts while trying to reduce inter-processor communication. For better cache performance, the problem is to reorder sparse matrices by suitable row and column permutations. Common approaches to such problems involve multilevel methods based on coarsening and uncoarsening (hyper)graphs, matching of similar vertices, searching for good separator sets and good splittings, and two-dimensional matrix splitting methods such as incorporated in the software package Mondriaan.

We will discuss new algorithms and features included in version 3.0 of the Mondriaan package, to be released soon. By using this package, and its subprograms MondriaanPlot and MondriaanMovie, we can visualise the partitioning process of a sparse matrix by various algorithms. We can also do this in Matlab. Mondriaan has now been made into a library that can be called from other programs, such as Matlab, Mathematica, or as a standalone program. New reordering methods have been included such as Separated Block Diagonal (SBD), along with well-known methods such as Bordered Block Diagonal. Doubly separated and doubly bordered versions are also included.

slides. In case, the pdf file does not display the movies, you can download the files here as a tar archieve.

Alfredo Buttari

Sparse QR factorization for Multicore architectures

Abstract: The advent of multicore processors represents a disruptive event in the history of computer science as conventional parallel programming paradigms resulted incapable to fully exploit their potential for concurrent computations. The need for different or new programming models clearly arises from recent studies which identify fine-granularity and dynamic execution as the keys to achieve high efficiency on multicore systems. This work presents an implementation of the sparse, multifrontal QR factorization capable of achieving high efficiency on multicore systems thanks to a fine-grained, data-flow parallel programming model.

slides

Louis-Claude Canon

A Dynamic Approach for Characterizing Collusion in Desktop Grids

Abstract: By exploiting idle time on volunteer machines, desktop grids provide a way to execute large sets of tasks with negligible maintenance and low cost. Although desktop grids are attractive for cost-conscious projects, relying on external resources may compromise the correctness of application execution due to the well-known unreliability of nodes. In this paper, we consider the most challenging threat model: organized groups of cheaters that may collude to produce incorrect results. We propose two on-line algorithms for detecting collusion and characterizing the participant behaviors. Using several real-life traces, we show that our approach is accurate and behavior.

slides

Pietro Cicotti

Hiding Communication in Scientific Applications with Graph-based Execution

Abstract: We have developed Tarragon, a library to facilitate the development of latency tolerant algorithms in order to hide ever increasing communication delays. Tarragon supports a data-flow execution model, and its runtime system automatically schedules tasks, obviating the need for complicated split-phase coding.

In this talk we briefly demonstrate the benefits of Tarragon on two model applications: finite-difference stencil computations and DNA sequence alignment. Then we discuss in-depth the implementation of parallel sparse LU factorization with Tarragon. To this end, we propose a novel task decomposition scheme and discuss the challenges that we face and the tradeoffs that arise: specifically, how to expose a high degree of parallelism, to reduce the space used for representing the graph, to hide the communication latency, and to preserve locality. Finally, we present a performance analysis of our preliminary results.

slides

Anne Benoit, Fanny Dufossé, Alain Girault, and Yves Robert

Reliability and performance optimization of pipelined real-time systems

Abstract: We consider pipelined real-time systems, commonly found in assembly lines, consisting of a chain of tasks executing on a distributed platform. Their processing is pipelined: each processor executes only one interval of consecutive tasks. We are therefore interested in minimizing both the input-output latency and the period. For dependability reasons, we are also interested in maximizing the reliability of the system. We therefore assign several processors to each interval of tasks, so as to increase the reliability of the system. We assume that both processors and communication links are unreliable and subject to transient failures, the arrival of which follows a constant parameter Poisson law. We also assume that the failures are statistically independent events. We study several variants of this multiprocessor mapping problem with several hypotheses on the target platform (homogeneous/heterogeneous speeds and/or failure rates). We provide NP-hardness complexity results, and optimal mapping algorithms for polynomial problem instances.

slides (pdf)

Matthieu Gallet

Non-clairvoyant Scheduling of Multiple Bag-of-tasks Applications

Abstract: The bag-of-tasks application model, albeit simple, arises in many application domains and has received a lot of attention in the scheduling literature. Previous works propose either theoretically sound solutions that rely on unrealistic assumptions, or ad-hoc heuristics with no guarantees on performance. This work attempts to bridge this gap through the design of non-clairvoyant heuristics based on solid theoretical foundations. The performance achieved by these heuristics is studied via simulations in a view to comparing them both to previously proposed solutions and to theoretical upper bounds on achievable performance. Also, an interesting theoretical result in this work is that a straightforward on-demand heuristic delivers asymptotically optimal performance when the communications or the computations can be neglected.

slides

Nicolas Gast and Bruno Gaujal

Work Stealing in Large Heterogeneous Systems

Abstract: Today's large-scale computers are made of many heterogeneous machines. A key issue when designing efficient job allocation algorithms is to balance the load over the different resources.

We will present a Markovian model for performance evaluation of a well-known load balancing technique, namely \emph{work stealing} in a large-scale heterogeneous system. Using mean field theory, we show that when the size of the system grows, this model converges to a system of deterministic ordinary differential equations that allows us to efficiently compute average performance indicators (such as average response times) as well as the distributions of these indicators.

We make a full study of this mean field model in the case where all resources are homogeneous, showing in particular that work stealing is very efficient, even when the latency of steals is large.

We also consider the case where the geometry plays a role: the system is made of several clusters, and stealing within one cluster is faster than stealing between clusters. In such cases, we compare different work stealing policies, based on stealing probabilities and we show that the main factor for deciding where to steal from is the load rather than the stealing latency.

slides

Mathias Jacquelin, Loris Marchal, and Yves Robert

Scheduling Under Storage Contraint

Abstract:

slides

Emmanuel Jeannot

Scheduling applications on GPU and CPU

Abstract: We consider the problem of scheduling a DAG where each task can be executed on two types of resources (a CPU or a GPU). We first provide a solution where we have an unbounded number of each resource. The solution is based on computation on the adjacency matrix in the max/+ algebra. Then we show how to transform this solution when we have a bounded number of each resource. In collaboration with Denis Barthou, Julien Jeager and Minhaj Khan.

slides

Domingo Jimenez

Modeling and optimization of scientific software in multicore

Abstract: The optimization of scientific codes in multicore systems has become important with the universal utilization of these systems as the basic components of computational environments. The scientists have access to multicore systems and in some cases to parallel codes for these systems, but the efficient use of these codes requires a deep knowledge of parallelism. As an alternative, the systematic modeling of multicore programs could help to provide the scientists with codes which run efficiently without user intervention. To do so, basic parallel codes are analyzed, and higher level codes are modeled with information obtained from the execution of the low level codes. Examples of the application of the technique in different fields (linear algebra, climatology, telecommunications, hydrology...) and experiments in different multicore and shared memory systems are shown.

slides

Julien Langou

Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures or How to change the underlying DAG of tile algorithms to increase performance

Abstract: The algorithms in the current sequential numerical linear algebra libraries (e.g. LAPACK) do not parallelize well on multicore architectures. A new family of algorithms, the tile algorithms, has recently been introduced. Previous research has shown that it is possible to write efficient and scalable tile algorithms for performing a Cholesky factorization, a (pseudo) LU factorization, and a QR factorization. In this talk, we attack the problem of the computation of the inverse of a symmetric positive definite matrix. We observe that, using a dynamic task scheduler, it is relatively painless to translate existing LAPACK code to obtain a ready-to-be-executed tile algorithm. However we demonstrate that non trivial compiler techniques (array renaming, loop reversal and pipelining) need then to be applied to further increase the parallelism of our application. We present preliminary experimental results.

slides

Arnaud Legrand

Scheduling Issues in Volunteer Computing Systems

Abstract:

slides (pdf)

Joachim Lepping and Uwe Schwiegelshohn

Decentralized Grid-Scheduling by Means of Co-evolutionary Learned Fuzzy-Systems

Abstract: In this talk, we utilize a competitive co-evolutionary Algorithm to optimize the parameter set of a Fuzzy System for job exchange in Computational Grids. In this domain, the providers of High Performance Computing (HPC) centers strive for minimizing the response time for their own customers by trying to distribute workload to other sites in the Grid environment. The Fuzzy System is used for steering each site's decisions whether to distribute or accept workload in a beneficial, yet egoistic direction. This scenario is particularly suited for the application of a competitive co-evolutionary algorithm: Grid sites' Fuzzy Systems are modeled as species, which evolve in different populations. While each species tries to minimize the response time for locally submitted jobs, their individuals' fitness is determined within the commonly shared ecosystem. Using real workload traces and Grid setups, we show that the opportunistic cooperation leads to significant improvements for both each Grid site and the overall system.

slides

Fredrik Manne

Parallel Greedy Matching Algorithms

Abstract: We consider parallel implementations of various greedy matching algorithms such as Minimum Degree and the Karp-Sipser algorithm on a distributed memory computer. We show that one can obtain quite good speedups as well as matchings using a hypergraph based partitioning of the matrix. Our objective is to use these matchings as initializations for a more complicated and slower algorithm for computing a perfect matching.

slides

Rami Melhem

Power aware scheduling on Multicores

Abstract: Power is becoming a major resource that should be managed in computer systems. Reducing energy consumption is as important in portable devices as in large data centers. Specifically, reduced energy consumption leads to increases battery life in portable systems and lower operational cost in data centers. Moreover, more power means more generated heat, which is particularly costly in data centers due to the expenses of cooling systems.

Frequency and voltage scheduling is a common technique for managing power in multiprocessor systems, in general, and chip multiprocessors, in particular. It minimizes energy consumption by dynamically matching the computation capacity of the system to the needs of the executing applications. It is crucial, however, to design power aware scheduling algorithms that do not reduce the response time or throughput since users will not accept any degradation in system's performance. In this talk, I will discuss frequency/voltage scheduling in a range of application models, including the Amdahl parallel model, the streaming model and the unstructured computation model.

slides

Paul Renaud-Goud

Sharing resources for performance and energy optimization of concurrent streaming applications

Abstract: We aim at finding optimal mappings for concurrent streaming applications. Each application consists of a linear chain with several stages, and processes successive data sets in pipeline mode. The objective is to minimize the energy consumption of the whole platform, while satisfying given performance-related bounds on the period and latency of each application. The problem is to decide which processors to enroll, at which speed (or mode) to use them, and which stages they should execute. We distinguish two mapping categories, interval mappings without reuse, and fully arbitrary general mappings.

On the theoretical side, we establish complexity results for this tri-criteria mapping problem (energy, period, latency), classifying polynomial versus NP-complete instances. Furthermore, we derive an integer linear program that provides the optimal solution in the most general case. On the experimental side, we design polynomial-time heuristics, and assess their absolute performance thanks to the linear program. One main goal is to assess the impact of processor sharing on the quality of the solution.

slides

Clément Rezvoy

Scalability of large-scale protein domain inference

Abstract: The exponential growth of sequence databases challenges the automatic inference of protein domain families. This growth prohibits the use of traditionally sequential algorithms like MkDom2, the algorithm that was used to construct the ProDom database. We present a distributed algorithm for protein domain inference, MPI_MkDom2. This algorithm greatly speeds the processing of large databases, therefore enabling the construction of new versions of ProDom, while preserving the structure of protein domain families built by MkDom2.

slides

Erik Saule, Doruk Bozdag, and Ümit V. Çatalyürek

Optimizing the maximum stretch of online tasks on a parallel system without preemption

Abstract: Most task schedulers for production clusters optimize the average or maximum time that tasks spend in the system. In the schedules generated by such schedulers small tasks may need to wait as much as large tasks. This leads to unfairness against small tasks, as the delay they experience relative to their size is much larger. In this talk, we focus on the optimization of a stretch-based objective function to address this fairness problem. Existing literature on this problem is based on single machine systems and preemptive tasks, which do not apply to production clusters consisting of hundreds of processors and applications with large memory footprints, respectively. In this talk, our goal is to optimize the maximum stretch of sequential tasks on arbitrary number of processors without preemption.

We first analyze the First-Come, First-Served (FCFS) algorithm and show that it is a (2 D + 1)-approximation algorithm where D is the ratio of the largest task's size to smallest task's size. We propose a dual-approximation based algorithm and show that it is locally a 2-approximation. Using the adversary technique we also show that the D factor can hardly be improved on. Furthermore, using resource augmentation, we demonstrate that having additional rho machines can improve the approximation ratio up to rho-th root of D. Based on this observation, we propose a heuristic that reserves a small portion of the cluster for tasks with large stretch values. Experimental results indicate that our dual-approximation based algorithm is an order of magnitude better than FCFS and the machine reservation heuristic can improve the performance significantly.

slides

HJ Siegel

Dynamic Robust Resource Allocation in a Heterogeneous Distributed Computing System

Abstract: Distributed computing systems are often heterogeneous mixtures of machines, used to execute applications with diverse computational requirements. In this study, we consider how to allocate resources to requests that involve the execution of utility applications on archived satellite data. These requests have soft deadlines and arrive dynamically. Our goal is to maximize the percentage of requests that complete by their deadlines. However, the exact execution time of each request on each machine will depend on the data set being processed, and is not known in advance. This uncertainty can lead to poor allocation decisions. It is important for system performance to be robust against uncertainty. To accomplish this, we define a mathematical model of stochastic robustness appropriate for this deadline-based dynamic environment that can be used during resource allocation to aid heuristic decision making. This model assumes that stochastic (experiential) information is available about application execution times. Our simulation results show the better performance of a heuristic based on this robustness model over several well known resource allocation techniques.

abstract and speaker info in (pdf) and slides (pdf)

Oliver Sinnen

Making contention scheduling aware of identical data

Abstract: It has been widely recognised that task scheduling on parallel systems cannot be accurate and efficient without a realistic view of the communication costs. Considering the limited amount of communication resources and their bandwidth is crucial. This has been addressed by the contention scheduling model where the communications are scheduled on the network resources, for example the ports in the one-port model. Such contention-aware scheduling is an extension of the classical task scheduling model based on communication delays. Problematic is, however, that identical data items are not distinguishable in the task graph. Under the classic model that was not an issue, but it is with contention scheduling. This talk will investigate the problem and presents some preliminary results.

slides

Leonel Sousa

Cooperative Execution on Heterogeneous Multi-core Systems

Abstract: The heterogeneity of parallel systems has increased at various levels. For example, in the off-the-shelf desktop systems, not only different processing elements can be found, such as multi-core general-purpose processors (CPUs), Graphic Processing Units (GPUs), or even other accelerators based on Field Programmable Arrays (FPGAs), but also buses with different characteristics are used to interconnect those processing elements (e.g. PCI-Express, FSB). In this talk we mainly address the arising challenges for accurately modeling such complex heterogeneous parallel systems, in order to take advantage of all available resources to cooperatively compute a single application. Beyond the classical problems of modeling and load balancing, we tackle more practical issues, such as how to accommodate the different programming models and tools adopted on heterogeneous systems. A collaborative execution environment (CHPS) is presented to efficiently perform parallel execution on heterogeneous desktop systems. Moreover, the possibility of exploiting task level parallelism on that type of systems is also discussed, where task scheduling needs to be not only communication aware but also to rely on accurate performance modeling.

slides

Frédéric Suter

On Cluster Resource Allocation for Multiple Parallel Task Graphs

Abstract: Many scientific applications can be structured as Parallel Task Graphs (PTGs), that is, graphs of data-parallel tasks. Adding data-parallelism to a task-parallel application provides opportunities for higher performance and scalability, but poses additional scheduling challenges. In this talk, I will present a study of the off-line scheduling of multiple PTGs on a single, homogeneous cluster. The objective is to optimize performance without compromising fairness among the PTGs. I will detail the range of previously proposed scheduling algorithms applicable to this problem, both from the applied and the theoretical literature, and propose minor improvements when possible. I will also propose an extensive evaluation of these algorithms in simulation, using both synthetic and real-world application configurations, using two different metrics for performance and one metric for fairness.

slides

Lamiel Toch and Jean-Marc Nicod

Online scheduling for moldable tasks in clusters

Abstract: A moldable task is a parallel task whose runtime and surface cannot increase when the number of allocated processors increases. In the literature, moldable task scheduling is studied with an "off-line" point of view. In this talk we tackle moldable tasks for cluster batch-scheduling. We present several algorithms and assess them with different metrics.

Our survey is based on dead-lines of jobs. When a job arrives in the waiting queue, the scheduler forecasts its placement using a simple first-come-first-serve scheduling and assign it a dead-line. The main idea is to use less resources than a job has requested to schedule it sooner while respecting its deadline. In practice cluster node virtualization will be used to create "virtual nodes". When a resource is released, jobs in the waiting queue are rescheduled by taking into account the previous calculated deadline. We use heuristics to choose the best job to schedule with the available resources and to allocate an appropriate number of processors for it.

slides

Denis Trystram

Distributed control scheduling

Abstract: The new computing platforms are characterized by a large number of computational units. Thus, the distributed nature highly influence the performances of the scheduling algorithms. In this talk, we will first investigate how behave the classical scheduling algorithms in a the context of distributed control, then, we will propose alternative solutions which guarantee better performances.

slides

Jon Weissman

Living on the Edge: Scheduling Edge Resources Across the Cloud

Abstract: The standard model of Cloud Computing posits a collection of a few largely centralized data-centers where data and computing are consolidated exploiting locality and economies of scale. This localization, however, introduces two potential problems: (1) it presumes the cost of moving data into the Cloud can be amortized, and (2) it has limited ability to exploit end-user locality. For applications that manipulate large geographically-dispersed and dynamically changing datasets, or that demand frequent response-sensitive end-user interaction, the localized Cloud may be ineffective.

In this talk, we present a new vision of Cloud Computing that addresses these shortcomings. We propose the use of edge computers in two novel ways. For problem (1), we propose Nebulas, a Cloud comprised of edge computers located across the Internet. We show how several classes of applications may be better suited to this new Cloud model. For problem (2), we describe how a proxy network comprised of edge computers can accelerate applications that access one or more Clouds. In both cases, we present the resource management challenges associated with each model.

slides