# Data-Aware Algorithms

Loris Marchal (and Olivier Beaumont), Fall 2020

In many today's computing systems, the cost, both in time and energy, of moving the data has become predominant compared to the cost of computing. Part of the solution for this problem can be found in algorithms: over the years, several computation models and algorithmic strategies have been proposed to optimize data movement instead of (or in addition to) the classical objective of computational complexity. The goal of this course is to present a variety of these solutions.

NB: you may check the webpage of this course for Fall 2019.

## Part 1: Modeling register allocations

In this first part, we will review studies which dates back to the 1960s, the pioneering work of Hong and King on I/O complexity, and their extensions.

• Register allocation on DAGs
• Pebble games
• I/O complexity, lower bounds

#### Other resources:

More information on the pebble games in the book "Models Of Computation" by John Savage (pdf available online): (black) pebble game in Chapter 10, red-blue pebble game and hierarchical memory pebble game in Chapter 11.

## Part 2: Cache-oblivious algorithms and to external memory models

In some cases, it is possible to design algorithms that minimize the data accesses whatever the size of the memory, or the cache, even when this size in unkown to the algorithm.

• External memory algorithms
• Cache oblivious algorithms
• Parallel external memory models
• Schedulers for multithreaded computations (parallel cache oblivious)

#### Other courses/resources:

• Lecture notes by Erik Demaine on Cache-Oblivious Algorithms and Data Structures at the 2002 EEF Summer School on Massive Data Sets
• Series of lectures at MIT by Erik Demaine, available on YouTube:
• Cormen et al., Introduction on Algorithms, Chap. 18: B-Trees

## Part 3: Algorithms for Big Data (by Olivier Beaumont)

Some data are so large that we have to drastically limit the data access when analising them, possibly with an impact on the accuracy of the result: it is sometimes better to have a quick good estimate than a precise result that takes forever to compute. Models have been proposed that either allow a simple linear scan of the data, and/or limit the temporary storage of the analysis.

• Data strem model
• Approximate counting
• Similarity search

## Part 4: Memory aware DAGs scheduling

Several scientific computations can be modeled as DAGs of tasks. The way the DAG is scheduled (traversed) influences the amount of memory needed for the computation.

• Scheduling algorithms for trees
• Extension to more general DAGs
• Extension to parallel scheduling

## Part 5: Communication-avoiding algorithms

In numerical linear algebra, some algorithms have been proposed who reach the lower bounds on data movement. We will study some of these algorithms, both for the sequential case (I/O minimization) and the parallel case (communication minimization).

• Principle of communication-avoiding algorithms
• Communication-avoiding matrix multiplication
• Extension to other linear algebra operations

## Evaluation

The evaluation will consist in two parts:

• 2 Homeworks: exercices given in class
• Research project per group of 1/2 students:
• Paper+Topic to choose in the following list
• Outcome:
• Report (5/10 pages)
• Presentation in class
• If topic includes programming:
• Implement the algorithm
• Perform interesting experiments
• Report the important points/difficulties when coding it
• Report the results and draw some conclusions
• If topic includes theoretical work (proof)
• Read and understand the proof
• Present the proof, possibly skipping some lemmas and intermediate points
• In both cases:
• Present the context and motivation of the paper
• Summarize the whole paper and its contributions

The presentations will take place on November 5 and 12, and the reports are due for November 2. Presentations can be held remotely (throught BBB) if needed. If the research project includes coding, the code must be sent with the report in a archive (zip, tgz or available online).

### Schedule of the presentations

Each group will do a presentation of 15-20 minutes, followed by 5-10 minutes of questions/comments. Here is the schedule, send me a mail to register.

November 5:

• 10:00: (free)
• 10:30: (free)
• 11:00: V. Pasquale (project 9)
• 11:30: (free)
• 13:30: (free)
• 14:00: (free)
• 14:30: (free)
• 15:00: (free)
• 15:30: (free)

November 12:

• 10:00: H. Thievenaz & V. Siguret (project 5)
• 10:30: Q. Corradi (project 8)
• 11:00: J. Bertrand & N. Champseix (project 1)
• 11:30: B. Allorant & P.-M. Marcille (project 10)
• 13:00: (free)
• 13:30: O. Cedron (project 4)
• 14:00: (free)
• 14:30: T. Marette & D. El Zein (project 2)
• 15:00: (free)
• 15:30: (free)

### Research papers for the project

Pebbles games and extensions:

• Project 2: (assigned to D. El Zein and T. Marette) Move Rules and Trade-Offs in the Pebble Game by P. van Emde Boas, J. van Leeuwen (Theoretical Computer Science 1979)
• Topic: proof of Theorem D (removing one pebble leads to exponential time)

External memory, cache oblivious algorithms:

Work stealing and other schedulers:

• Project 9: (assigned to V. Pasquale) Hierarchical Work-Stealing by J.-N. Quintin and F. Wagner (Euro-Par 2010)
• Topic: code a (simple, centralized) work-stealing simulator and test the data locality of several work-stealing variants