Emmanuel Agullo (speaker), Luc Giraud, Abdou Guermouche, and Jean Roman
Complexity analysis of sparse hybrid linear solvers
Abstract:
Sparse hybrid solvers are a trade-off between direct methods and iterative
methods. Part of the computation is first performed with a direct method
in order to ensure numerical robustness; the algorithm then switches to an
iterative method to alleviate the computational complexity and memory usage.
The convergence and number of iterations in the second step depend on the
amount of computation performed in the direct part. In this talk, we propose
a model to compute the subsequent computational and memory costs,
depending on the ratio of computation performed in direct. We apply this model on several classes of academic problems and illustrate
the possible trade-offs in terms of computational and memory complexity.
slides
Lionel Eyraud-Dubois and Olivier Beaumont
Bounded Multi Port Model: Motivations, Feasability & Application to Broadcast
Abstract:
The bounded multi-port model is a model for communications in distributed computing platforms, in which each node may be involved in several simultaneous communications, provided bandwidth constraints are respected. In highly heterogeneous environment, this model allows to make better use of the available bandwidth of large server nodes. In this talk, we discuss its feasability by showing that simple optimal algorithms exist for basic problems, and we will show that their implementation require to use QoS mechanisms to enforce the prescribed bandwidth sharing. We also study steady-state application-level broadcast, with an additional degree constraint to ensure the realism of the obtained solution. For this particular problem, we provide a polynomial algorithm based on a minimal resource augmentation.
slides
Rob H. Bisseling (speaker), Bas O. Fagginger Auer, and Albert-Jan N. Yzelman
Sparse matrix partitioning, ordering, and visualisation by Mondriaan 3.0
Abstract:
This talk presents two combinatorial problems
encountered in scientific computations on today's
high-performance architectures, such as parallel computers
with many processors and several cores on each processor,
and with sophisticated memory hierarchies and multiple levels of cache.
For parallelism, the most important problem is to partition
sparse matrices, graphs, or hypergraphs into nearly equal-sized
parts while trying to reduce inter-processor communication.
For better cache performance, the problem is to
reorder sparse matrices by suitable row and column permutations.
Common approaches to such problems involve multilevel
methods based on coarsening and uncoarsening (hyper)graphs,
matching of similar vertices, searching for good separator sets
and good splittings, and two-dimensional matrix splitting methods
such as incorporated in the software package Mondriaan.
We will discuss new algorithms and features included in version 3.0
of the Mondriaan package, to be released soon. By using this package,
and its subprograms MondriaanPlot and MondriaanMovie,
we can visualise the partitioning process of a sparse matrix
by various algorithms. We can also do this in Matlab.
Mondriaan has now been made into a library that can be called from
other programs, such as Matlab, Mathematica, or as a standalone program.
New reordering methods have been included such as
Separated Block Diagonal (SBD), along with well-known methods such as
Bordered Block Diagonal. Doubly separated and doubly bordered versions
are also included.
slides. In case, the pdf file does not display the movies, you can download the files
here as a tar archieve.
Alfredo Buttari
Sparse QR factorization for Multicore architectures
Abstract:
The advent of multicore processors represents a disruptive event in
the history of computer science as conventional parallel programming
paradigms resulted incapable to fully exploit their potential for
concurrent computations. The need for different or new programming
models clearly arises from recent studies which identify
fine-granularity and dynamic execution as the keys to achieve high
efficiency on multicore systems. This work presents an implementation
of the sparse, multifrontal QR factorization capable of achieving high
efficiency on multicore systems thanks to a fine-grained, data-flow
parallel programming model.
slides
Louis-Claude Canon
A Dynamic Approach for Characterizing Collusion in Desktop Grids
Abstract:
By exploiting idle time on volunteer machines, desktop grids
provide a way to execute large sets of tasks with negligible maintenance
and low cost. Although desktop grids are attractive for cost-conscious
projects, relying on external resources may compromise the correctness
of application execution due to the well-known unreliability of nodes.
In this paper, we consider the most challenging threat model: organized
groups of cheaters that may collude to produce incorrect results.
We propose two on-line algorithms for detecting collusion and
characterizing the participant behaviors. Using several real-life
traces, we show that our approach is accurate and behavior.
slides
Pietro Cicotti
Hiding Communication in Scientific Applications with Graph-based
Execution
Abstract:
We have developed Tarragon, a library to facilitate the development of
latency tolerant algorithms in order to hide ever increasing
communication delays. Tarragon supports a data-flow execution model,
and its runtime system automatically schedules tasks, obviating the
need for complicated split-phase coding.
In this talk we briefly demonstrate the benefits of Tarragon on two model
applications: finite-difference stencil computations and DNA sequence
alignment. Then we discuss in-depth the implementation of parallel
sparse LU factorization with Tarragon. To this end, we propose a novel
task decomposition scheme and discuss the challenges that we face and
the tradeoffs that arise: specifically, how to expose a high degree of
parallelism, to reduce the space used for representing the graph, to
hide the communication latency, and to preserve locality.
Finally, we present a performance analysis of our preliminary results.
slides
Anne Benoit, Fanny Dufossé, Alain Girault, and Yves Robert
Reliability and performance optimization of pipelined real-time systems
Abstract:
We consider pipelined real-time systems, commonly
found in assembly lines, consisting of a chain of tasks executing on
a distributed platform. Their processing is pipelined: each
processor executes only one interval of consecutive tasks. We are
therefore interested in minimizing both the input-output latency and
the period. For dependability reasons, we are also interested in
maximizing the reliability of the system. We therefore assign
several processors to each interval of tasks, so as to increase the
reliability of the system. We assume that both processors and
communication links are unreliable and subject to transient
failures, the arrival of which follows a constant parameter Poisson
law. We also assume that the failures are statistically independent
events. We study several variants of this multiprocessor mapping
problem with several hypotheses on the target platform
(homogeneous/heterogeneous speeds and/or failure rates). We provide
NP-hardness complexity results, and optimal mapping algorithms for
polynomial problem instances.
slides (pdf)
Matthieu Gallet
Non-clairvoyant Scheduling of Multiple Bag-of-tasks Applications
Abstract:
The bag-of-tasks application model, albeit simple, arises in many
application domains and has received a lot of attention in the scheduling
literature. Previous works propose either theoretically sound solutions that
rely on unrealistic assumptions, or ad-hoc heuristics with no guarantees on
performance. This work attempts to bridge this gap through the design of
non-clairvoyant heuristics based on solid theoretical foundations. The
performance achieved by these heuristics is studied via simulations in a view
to comparing them both to previously proposed solutions and to theoretical
upper bounds on achievable performance. Also, an interesting theoretical result
in this work is that a straightforward on-demand heuristic delivers
asymptotically optimal performance when the communications or the computations
can be neglected.
slides
Nicolas Gast and Bruno Gaujal
Work Stealing in Large Heterogeneous Systems
Abstract:
Today's large-scale computers are made of many heterogeneous machines. A
key issue when designing efficient job allocation algorithms is to
balance the load over the different resources.
We will present a Markovian model for performance evaluation of a well-known
load balancing technique, namely \emph{work stealing} in a large-scale
heterogeneous system. Using mean field theory, we show that when the size
of the system grows, this model converges to a system of deterministic
ordinary differential equations that allows us to efficiently compute
average performance indicators (such as average response times) as well
as the distributions of these indicators.
We make a full study of this mean field model in the case where all
resources are homogeneous, showing in particular that work stealing is
very efficient, even when the latency of steals is large.
We also consider the case where the geometry plays a
role: the system is made of several clusters, and stealing within one cluster is faster than
stealing between clusters.
In such cases, we compare different work stealing policies, based on
stealing probabilities and
we show that the main factor for deciding where to steal from is the
load rather than the stealing latency.
slides
Mathias Jacquelin, Loris Marchal, and Yves Robert
Scheduling Under Storage Contraint
Abstract:
slides
Emmanuel Jeannot
Scheduling applications on GPU and CPU
Abstract:
We consider the problem of scheduling a DAG where each task can be executed on two types of resources (a CPU or a GPU).
We first provide a solution where we have an unbounded number of each resource. The solution is based on computation on the
adjacency matrix in the max/+ algebra. Then we show how to transform this solution when we have a bounded number of each resource.
In collaboration with Denis Barthou, Julien Jeager and Minhaj Khan.
slides
Domingo Jimenez
Modeling and optimization of scientific software in multicore
Abstract:
The optimization of scientific codes in multicore systems has become
important with the universal utilization of these systems as
the basic components of computational environments. The scientists
have access to multicore systems and in some cases to parallel
codes for these systems, but the efficient use of these codes
requires a deep knowledge of parallelism. As an alternative,
the systematic modeling of multicore programs could help to
provide the scientists with codes which run efficiently without
user intervention. To do so, basic parallel codes are analyzed,
and higher level codes are modeled with information obtained
from the execution of the low level codes. Examples of the
application of the technique in different fields (linear algebra,
climatology, telecommunications, hydrology...) and experiments
in different multicore and shared memory systems are shown.
slides
Julien Langou
Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures or How to change the underlying DAG of tile algorithms to increase performance
Abstract:
The algorithms in the current sequential numerical linear algebra libraries (e.g. LAPACK) do not parallelize well on multicore architectures. A new family of algorithms, the tile algorithms, has recently been introduced. Previous research has shown that it is possible to write efficient and scalable tile algorithms for performing a Cholesky factorization, a (pseudo) LU factorization, and a QR factorization. In this talk, we attack the problem of the computation of the inverse of a symmetric positive definite matrix. We observe that, using a dynamic task scheduler, it is relatively painless to translate existing LAPACK code to obtain a ready-to-be-executed tile algorithm. However we demonstrate that non trivial compiler techniques (array renaming, loop reversal and pipelining) need then to be applied to further increase the parallelism of our application. We present preliminary experimental results.
slides
Arnaud Legrand
Scheduling Issues in Volunteer Computing Systems
Abstract:
slides (pdf)
Joachim Lepping and Uwe Schwiegelshohn
Decentralized Grid-Scheduling by Means of Co-evolutionary Learned Fuzzy-Systems
Abstract:
In this talk, we utilize a competitive co-evolutionary
Algorithm to optimize the parameter set of a Fuzzy System for job exchange in Computational Grids. In this domain, the providers of High Performance Computing (HPC) centers strive for minimizing the response
time for their own customers by trying to distribute workload to
other sites in the Grid environment. The Fuzzy System is used
for steering each site's decisions whether to distribute or accept
workload in a beneficial, yet egoistic direction. This scenario is
particularly suited for the application of a competitive co-evolutionary algorithm: Grid sites' Fuzzy Systems are modeled as species, which evolve in different populations. While each species tries to minimize
the response time for locally submitted jobs, their individuals'
fitness is determined within the commonly shared ecosystem.
Using real workload traces and Grid setups, we show that the
opportunistic cooperation leads to significant improvements for
both each Grid site and the overall system.
slides
Fredrik Manne
Parallel Greedy Matching Algorithms
Abstract:
We consider parallel implementations of various greedy matching algorithms such as Minimum Degree and the Karp-Sipser algorithm on a distributed memory computer. We show that one can obtain quite good speedups as well as matchings using a hypergraph based partitioning of the matrix. Our objective is to use these matchings as initializations for a more complicated and slower algorithm for computing a perfect matching.
slides
Rami Melhem
Power aware scheduling on Multicores
Abstract:
Power is becoming a major resource that should be managed in computer systems. Reducing energy consumption is as important in portable devices as in large data centers. Specifically, reduced energy consumption leads to increases battery life in portable systems and lower operational cost in data centers. Moreover, more power means more generated heat, which is particularly costly in data centers due to the expenses of cooling systems.
Frequency and voltage scheduling is a common technique for managing power in multiprocessor systems, in general, and chip multiprocessors, in particular. It minimizes energy consumption by dynamically matching the computation capacity of the system to the needs of the executing applications. It is crucial, however, to design power aware scheduling algorithms that do not reduce the response time or throughput since users will not accept any degradation in system's performance. In this talk, I will discuss frequency/voltage scheduling in a range of application models, including the Amdahl parallel model, the streaming model and the unstructured computation model.
slides
Paul Renaud-Goud
Sharing resources for performance and energy optimization of concurrent streaming applications
Abstract:
We aim at finding optimal mappings for concurrent streaming applications. Each application consists of a linear chain with several stages, and processes successive data sets in pipeline mode. The objective is to minimize the energy consumption of the whole platform, while satisfying given performance-related bounds on the period and latency of each application. The problem is to decide which processors to enroll, at which speed (or mode) to use them, and which stages they should execute. We distinguish two mapping categories, interval mappings without reuse, and fully arbitrary general mappings.
On the theoretical side, we establish complexity results for this tri-criteria mapping problem (energy, period, latency), classifying polynomial versus NP-complete instances. Furthermore, we derive an integer linear program that provides the optimal solution in the most general case.
On the experimental side, we design polynomial-time heuristics, and assess their absolute performance thanks to the linear program. One main goal is to assess the impact of processor sharing on the quality of the solution.
slides
Clément Rezvoy
Scalability of large-scale protein domain inference
Abstract:
The exponential growth of sequence databases challenges the automatic
inference of protein domain families. This growth prohibits the use of
traditionally sequential algorithms like MkDom2, the algorithm that was
used to construct the ProDom database. We present a distributed
algorithm for protein domain inference, MPI_MkDom2. This algorithm
greatly speeds the processing of large databases, therefore enabling the
construction of new versions of ProDom, while preserving the structure
of protein domain families built by MkDom2.
slides
Erik Saule, Doruk Bozdag, and Ümit V. Çatalyürek
Optimizing the maximum stretch of online tasks on a parallel system without
preemption
Abstract:
Most task schedulers for production clusters optimize the average or
maximum time that tasks spend in the system. In the schedules
generated by such schedulers small tasks may need to wait as much as
large tasks. This leads to unfairness against small tasks, as the
delay they experience relative to their size is much larger. In
this talk, we focus on the optimization of a stretch-based
objective function to address this fairness problem. Existing
literature on this problem is based on single machine systems and
preemptive tasks, which do not apply to production clusters
consisting of hundreds of processors and applications with large
memory footprints, respectively. In this talk, our goal is to
optimize the maximum stretch of sequential tasks on arbitrary number
of processors without preemption.
We first analyze the First-Come, First-Served (FCFS) algorithm
and show that it is a (2 D + 1)-approximation algorithm
where D is the ratio of the largest task's size to smallest
task's size. We propose a dual-approximation based algorithm and
show that it is locally a 2-approximation. Using the adversary
technique we also show that the D factor can hardly be
improved on. Furthermore, using resource augmentation, we
demonstrate that having additional rho machines can improve the
approximation ratio up to rho-th root of D. Based on this
observation, we propose a heuristic that reserves a small portion of
the cluster for tasks with large stretch values. Experimental
results indicate that our dual-approximation based algorithm is an
order of magnitude better than FCFS and the machine reservation
heuristic can improve the performance significantly.
slides
HJ Siegel
Dynamic Robust Resource Allocation
in a Heterogeneous Distributed Computing System
Abstract:
Distributed computing systems are often heterogeneous mixtures of machines, used to execute applications with diverse computational requirements. In this study, we consider how to allocate resources to requests that involve the execution of utility applications on archived satellite data. These requests have soft deadlines and arrive dynamically. Our goal is to maximize the percentage of requests that complete by their deadlines. However, the exact execution time of each request on each machine will depend on the data set being processed, and is not known in advance. This uncertainty can lead to poor allocation decisions. It is important for system performance to be robust against uncertainty. To accomplish this, we define a mathematical model of stochastic robustness appropriate for this deadline-based dynamic environment that can be used during resource allocation to aid heuristic decision making. This model assumes that stochastic (experiential) information is available about application execution times. Our simulation results show the better performance of a heuristic based on this robustness model over several well known resource allocation techniques.
abstract and speaker info in (
pdf) and
slides (
pdf)
Oliver Sinnen
Making contention scheduling aware of identical data
Abstract:
It has been widely recognised that task scheduling on parallel systems cannot
be accurate and efficient without a realistic view of the communication costs.
Considering the limited amount of communication resources and their bandwidth
is crucial. This has been addressed by the contention scheduling model where
the communications are scheduled on the network resources, for example the
ports in the one-port model. Such contention-aware scheduling is an extension
of the classical task scheduling model based on communication delays.
Problematic is, however, that identical data items are not distinguishable in
the task graph. Under the classic model that was not an issue, but it is with
contention scheduling. This talk will investigate the problem and presents
some preliminary results.
slides
Leonel Sousa
Cooperative Execution on Heterogeneous Multi-core Systems
Abstract:
The heterogeneity of parallel systems has increased at various levels. For example, in the off-the-shelf desktop systems, not only different processing elements can be found, such as multi-core general-purpose processors (CPUs), Graphic Processing Units (GPUs), or even other accelerators based on Field Programmable Arrays (FPGAs), but also buses with different characteristics are used to interconnect those processing elements (e.g. PCI-Express, FSB). In this talk we mainly address the arising challenges for accurately modeling such complex heterogeneous parallel systems, in order to take advantage of all available resources to cooperatively compute a single application. Beyond the classical problems of modeling and load balancing, we tackle more practical issues, such as how to accommodate the different programming models and tools adopted on heterogeneous systems. A collaborative execution environment (CHPS) is presented to efficiently perform parallel execution on heterogeneous desktop systems. Moreover, the possibility of exploiting task level parallelism on that type of systems is also discussed, where task scheduling needs to be not only communication aware but also to rely on accurate performance modeling.
slides
Frédéric Suter
On Cluster Resource Allocation for Multiple Parallel Task Graphs
Abstract:
Many scientific applications can be structured as Parallel Task Graphs (PTGs), that is, graphs of data-parallel tasks. Adding data-parallelism to a task-parallel application provides opportunities for higher performance and scalability, but poses additional scheduling challenges. In this talk, I will present a study of the off-line scheduling of multiple PTGs on a single, homogeneous cluster. The objective is to optimize performance without compromising fairness among the PTGs.
I will detail the range of previously proposed scheduling algorithms applicable to this problem, both from the applied and the theoretical literature, and propose minor improvements when possible. I will also propose an extensive evaluation of these algorithms in simulation, using both synthetic and real-world application configurations, using two different metrics for performance and one metric for fairness.
slides
Lamiel Toch and Jean-Marc Nicod
Online scheduling for moldable tasks in clusters
Abstract:
A moldable task is a parallel task whose runtime and surface cannot
increase when the number of allocated processors increases. In the
literature, moldable task scheduling is studied with an "off-line"
point of view. In this talk we tackle moldable tasks for cluster
batch-scheduling. We present several algorithms and assess them with
different metrics.
Our survey is based on dead-lines of jobs. When a job arrives in the
waiting queue, the scheduler forecasts its placement using a simple
first-come-first-serve scheduling and assign it a dead-line. The main
idea is to use less resources than a job has requested to schedule it
sooner while respecting its deadline. In practice cluster node
virtualization will be used to create "virtual nodes". When a
resource is released, jobs in the waiting queue are rescheduled by
taking into account the previous calculated deadline. We use
heuristics to choose the best job to schedule with the available
resources and to allocate an appropriate number of processors for it.
slides
Denis Trystram
Distributed control scheduling
Abstract:
The new computing platforms are characterized by a large number of computational units.
Thus, the distributed nature highly influence the performances of the scheduling algorithms.
In this talk, we will first investigate how behave the classical scheduling algorithms in
a the context of distributed control,
then, we will propose alternative solutions which guarantee better performances.
slides
Jon Weissman
Living on the Edge: Scheduling Edge Resources Across the Cloud
Abstract:
The standard model of Cloud Computing posits a collection of a few
largely centralized data-centers
where data and computing are consolidated exploiting locality and
economies of scale. This localization,
however, introduces two potential problems: (1) it presumes the cost
of moving data into the Cloud can be amortized, and (2) it has limited
ability to exploit end-user locality. For applications that manipulate
large geographically-dispersed and dynamically changing datasets, or
that demand frequent response-sensitive end-user interaction, the
localized Cloud may be ineffective.
In this talk, we present a new vision of Cloud Computing that addresses
these shortcomings. We propose the use of edge computers in two novel ways.
For problem (1), we propose Nebulas, a Cloud comprised of edge
computers located across the Internet. We show how several classes of
applications may be better suited to this new Cloud model. For problem
(2), we describe how a proxy network comprised of edge computers can
accelerate applications that access one or more Clouds. In both cases,
we present
the resource management challenges associated with each model.
slides