Data-Aware Algorithms

Loris Marchal (and Olivier Beaumont), Fall 2019


In many today's computing systems, the cost, both in time and energy, of moving the data has become predominant compared to the cost of computing. Part of the solution for this problem can be found in algorithms: over the years, several computation models and algorithmic strategies have been proposed to optimize data movement instead of (or in addition to) the classical objective of computational complexity. The goal of this course is to present a variety of these solutions.

Part 1: Modeling register allocations

In this first part, we will review studies which dates back to the 1960s, the pioneering work of Hong and King on I/O complexity, and their extensions.

  • Register allocation on DAGs
  • Pebble games
  • I/O complexity, lower bounds
  • Space-time tradeoffs

Other resources:

More information on the pebble games in the book "Models Of Computation" by John Savage (pdf available online): (black) pebble game in Chapter 10, red-blue pebble game and hierarchical memory pebble game in Chapter 11.

Some articles where results are taken from:

Part 2: Cache-oblivious algorithms and to external memory models

In some cases, it is possible to design algorithms that minimize the data accesses whatever the size of the memory, or the cache, even when this size in unkown to the algorithm.

  • External memory algorithms
  • Cache oblivious algorithms
  • Parallel external memory models
  • Schedulers for multithreaded computations (parallel cache oblivious)

Other courses/resources:

Some articles where results are taken from:

Part 3: Memory aware DAGs scheduling

Several scientific computations can be modeled as DAGs of tasks. The way the DAG is scheduled (traversed) influences the amount of memory needed for the computation.

  • Scheduling algorithms for trees
  • Extension to more general DAGs
  • Extension to parallel scheduling

Some articles where results are taken from:

Part 4: Communication-avoiding algorithms

In numerical linear algebra, some algorithms have been proposed who reach the lower bounds on data movement. We will study some of these algorithms, both for the sequential case (I/O minimization) and the parallel case (communication minimization).

  • Principle of communication-avoiding algorithms
  • Communication-avoiding matrix multiplication
  • Extension to other linear algebra operations

Available material:

Other courses/resources:

Some articles where results are taken from:

Part 5: Algorithms for Big Data (by Olivier Beaumont)

Some data are so large that we have to drastically limit the data access when analising them, possibly with an impact on the accuracy of the result: it is sometimes better to have a quick good estimate than a precise result that takes forever to compute. Models have been proposed that either allow a simple linear scan of the data, and/or limit the temporary storage of the analysis.

  • Data strem model
  • Approximate counting
  • Similarity search

Other resources:

Evaluation

The evaluation will consist in two parts:

  • Written evaluations on the models and algorithms seen during classes (two evaluations of ~1.5h each).
  • Research project per group of 1/2 students:
    • Read one research paper (given)
    • Explain research question and algorithmic solution(s)
    • Contribute to the experimental validation!
      • Code some algorithm (100 lines of python is OK)
      • Find or generate relevant dataset
      • Plot/print relevant results
      • Draw (partial) conclusions
    • Outcome:
      • Report (5/10 pages) due for January 5
      • Presentation in class (January 8 or 15)

Research papers for the project

Pebbles games and extensions:

External memory, cache oblivious algorithms:

Work stealing and other schedulers:

Task graph scheduling:

Randomized linear algebra:

Recommendation system:

Author: Loris Marchal

Created: 2019-12-09 Mon 11:31

Emacs 25.3.50.1 (Org mode 8.2.10)

Validate