Cours de Master 2012

Cours de Master 2012 -- 2012 Master Course

Compilation avancée et optimisation de programmes
(Advanced compilation and program optimizations)

Where, when, who?

Courses take place in room B2, on some (not all) Monday and Wednesday, from 4pm to 6pm.
Current students are:

Gabriel Radanne 10

Thomas Lambert 11

~~Pierre-Alain Scribot 1~~

Silviu-Ioan Filip 11

~~Vincent Despré 1~~

Jean-Florent Raymond 10

Florent Robic 10

~~Théo Dzierbicki 1~~

Content of the course

Master course presentation
Friday, Sept. 7: slides.

Short (5') presentations, for all students, among all courses.

1st course: Introduction
Wednesday, Sept. 12: slides.

General statements/views on compilation and code optimizations

Why loops are important

Examples of architectural peculiarities

2nd course: Software pipelining
Monday, Sept. 17: slides.

Model: graph, delays, dependence distances, average cycle time

Algorithm for optimal schedule without resource constraints

Example and sketch of modulo scheduling heuristic

DAG, list scheduling, and Graham's upper bound.

Retiming and loop compaction: introduction

For more details, you can take a look at:
Gondran and Minoux. "Graphs and Algorithms". Wiley. 1984.
Rau. "Iterated Modulo Scheduling". International Journal of Parallel Programming, 24(1):3-64, 1996.
Darte, Vivien, and Robert. "Scheduling and Automatic Parallelization (Chap. 1 & 3)". Birkhäuser. 2000.

3rd course: Retiming, loop compaction, loop fusion
Wednesday, Sept. 20: slides. Note: slides 28-40 skipped.

Retiming and Leiserson-Saxe algorithm for clock period minimization

Graham's upper bound adapted for loop compaction and software pipelining

Reminders on linear programming

Out-of-kilter algorithm for minimization, by retiming, of the number of intra-loop dependences

Impact of retiming on memory usage (introduced but skipped)

Loop fusion problems without retiming (next time, with retiming)

For more details, you can take a look at:
Gondran and Minoux. "Graphs and Algorithms". Wiley. 1984.
Gasperoni and Schwiegelshohn. "Generating Close to Optimum Loop Schedules on Parallel Processors". Parallel Processing Letters, 4(4):391-403, 1994.
Darte and Huard. "Loop Shifting for Loop Compaction". International Journal of Parallel Programming. 28(5):499-534, 2000.
Kennedy and McKinley. "Typed Fusion with Applications to Parallel and Sequential Code Generation". Tech. Report 94646, Rice University, 1994.
Darte. "On the Complexity of Loop Fusion". Parallel Computing, 26(9):1175-1193, 2000.

4th course: End of previous course + exercices
Monday, Oct. 1: slides and exercises (to finish as home work).

End of previous course: Fusion with retiming.

Exercise 1, question 5.

Exercise 2, question 1 and question 2 (beginning).

5th course: Back-end code representations + exercices
Wednesday, Oct. 3: slides.

Control-flow graphs and dominance.

Loop nesting forest: algorithms of Tarjan and Havlak (modified by Ramalingam).

Reducible and irreducible CFGs.

Exercise 4, questions 1, 2, 3 (beginning).

For more details, you can take a look at:
Ramalingam. "On loops, dominators, and dominance frontiers". ACM Transactions on Programming Languages and Systems, 24(5):455-490, 2002.
Ramalingam. "Identifying Loops in Almost Linear Time". ACM Transactions on Programming Languages and Systems, 21(2):175-188, 1999.

6th course: Static single assignment form
Wednesday, Oct. 10: slides.

Static single assignment: definition.

Construction with dominance frontier. Example from Nelson Amaral's slides.

Translation out of SSA. Correctness, coalescing, interferences with values.

For more details, you can take a look at:
Cytron, Ferrante, Rosen, Wegman, Zadeck. "Efficiently computing static single assignment form and the control dependence graph". ACM Transactions on Programming Languages and Systems, 13(4):451-490, 1991.
Briggs, Cooper, Harvey, Simpson. "Practical improvements to the construction and destruction of static single assignment form". Software -- Practice and Experience, 28(8):859-881, 1998.
Sreedhar, Ju, Gillies, Santhanam. "Translating out of static single assignment form". In Static Analysis Symposium (SAS'99), Italy, 1999, pp. 194-204.
Boissinot, Darte, Dupont de Dinechin, Guillon, Rastello. "Revisiting out-of-SSA translation for correctness, code quality, and efficiency". International Symposium on Code Generation and Optimization (CGO'09), pp. 114-125, IEEE, 2009.

7th course: SSA (end) and register allocation (beginning)
Wednesday, Oct. 17: slides SSA et slides register allocation.

Going out of SSA: fast version.

Properties of SSA, in particular liveness.

For more details, you can take a look at:
Chaitin, Auslander, Chandra, Cocke, Hopkins, Markstein. "Register allocation via coloring". Computer Languages, 6:47-57, 1981.
Chaitin. "Register allocation & spilling via graph coloring". ACM SIGPLAN Symposium on Compiler Construction, SIGPLAN Notices, 17(6):98-105, 1982.
George and Appel. "Iterated register coalescing". ACM Transactions on Programming Languages and Systems, 18(3):300-324, 1996.

8th course: register allocation (end) and polyhedral optimizations (beginning)
Monday, Oct. 22: slides register allocation et slides polyhedral optimizations.

Tracing the NP-completeness of register allocation.

Introduction to polyhedral optimizations.

Decomposition of Karp, Miller, and Winograd.

For more details, you can take a look at:
Bouchez, Darte, Guillon, Rastello. "Register allocation: What does the NP-completeness proof of Chaitin et al. really prove? Or revisiting register allocation: why and how". International Workshop on Languages and Compilers for Parallel Computing (LCPC'06), LNCS 4382, pp. 283-298, 2007.
Karp, Miller, Winograd. "The organization of computations for uniform recurrence equations". Journal of the ACM, 14(3):563--590, 1967.
Darte, Vivien. "Revisiting the Decomposition of Karp, Miller, and Winograd". Parallel Processing Letters 5:551-562, 1995.
Darte, Vivien, and Robert. "Scheduling and Automatic Parallelization (Chap. 4)". Birkhäuser. 2000.

9th course: applications of KMW decomposition
Tuesday, Oct. 23: slides.

Catalog of loop transformations.

Detection of parallelism in nested DO loops.

Link with termination of while loops.

For more details, you can take a look at:
Allen, Kennedy. "Optimizing Compilers for Modern Architectures: A Dependence-based Approach". Morgan Kaufman, 2002.
Darte, Vivien, and Robert. "Scheduling and Automatic Parallelization (Chap. 5)". Birkhäuser. 2000.
Allen, Callahan, Kennedy. "Automatic decomposition of scientific programs for parallel execution", Conference on Principles of Parallel Languages, pp. 63-76, 1987.
Feautrier. "Dataflow analysis of array and scalar references". International Journal of Parallel programming, 20(1):23-53, 1991.
Feautrier. "Some efficient solutions to the affine scheduling problem. Part II, multidimensional time". International Journal of Parallel Programming 21(6): 389:420, 1992.
Alias, Darte, Feautrier, Gonnord. "Multi-dimensional Rankings, Program Termination, and Complexity Bounds of Flowchart Programs". In Static Analysis Symposium (SAS'10), pp. 117-133, 2010.
Darte. "Optimal Parallelism Detection in Nested Loops". Encyclopedia of Parallel Programming. Springer, 2011.
and many more...

10th course: static analysis for code optimizations (by Laure Gonnord)
Monday, Nov. 12: slides.

Data-flow analysis.

Abstract interpretation.

Linear relation analysis.

For more details, you can take a look at:
Nielson, Nielson, Hankin, "Principles of Program Analysis", Springer, 2004.
Cousot, Halbwachs. "Automatic discovery of linear restraints among variables of a program", POPL 1978.
Monniaux, Gonnord. "Using bounded model checking to focus fixpoint iterations". SAS 2011.
Halbwachs, Proy, Raymond. "Verification of linear hybrid systems by means of convex approximations", SAS 1994.
Henry, Monniaux, Moy. "PAGAI: a path sensitive static analyzer". In Bertrand Jeannet, editor, Tools for Automatic Program Analysis (TAPAS), 2012.

11th course: kernel offloading and optimized communications
Tuesday, Nov. 20: slides.

Kernel offloading. Application to high-level synthesis.

Double-buffering execution style with loop tiling.

Optimized communication coalescing with inter-tile reuse.

For more details, you can take a look at:
Darte, Schreiber, Villard. "Lattice-Based Memory Allocation". IEEE Transactions on Computers, 54(10):1242-1257, 2005.
Alias, Darte, Plesco. "Optimizing DDR-SDRAM Communications at C-Level for Automatically-Generated Hardware Accelerators. An Experience with the Altera C2H HLS Tool". International Conference on Application-specific Systems, Architectures and Processors (ASAP'10), 2010.
Alias, Darte, Plesco. "Optimizing Remote Accesses for Offloaded Kernels: Application to High-Level Synthesis for FPGA". 2nd International Workshop on Polyhedral Compilation Techniques (IMPACT'12), 2012.
Chavarría-Miranda, Mellor-Crummey. "Effective communication coalescing for data-parallel applications". Principles and Practice of Parallel Programming (PPoPP'05), 2005.
Issenin, Borckmeyer, Miranda, Dutt. "DRDU: A data reuse analysis technique for efficient scratchpad memory management". ACM TODAES 2007.
Baskaran, Bondhugula, Krishnam, Ramanujam, Rountev, Sadayappan. "Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories". Principles and Practice of Parallel Programming (PPoPP'08), 2008.
Leung, Vasilach, Meister, Baskaran, Wohlford, Bastoul, Lethin. "A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction". Workshop on General Purpose Processing on Graphics Processing Units (GPGPU'10), 2010.
Guelton, Amini, Creusillet. "Beyond do loops: data transfer generation with convex array regions". International Workshop on Languages and Compilers for Parallel Computing (LCPC'12), 2012.
and many more...

12th session: readings. Give me your preferences!

Di, Ye, Su, Sui, Xue. "Automatic Parallelization of Tiled Loop Nests with Enhanced Fine-Grained Parallelism on GPUs" . International Conference on Parallel Processing (ICPP'12), 2012. Attributed to Silviu
Zhang, Wu, Xue. "An Efficient Heuristic for Instruction Scheduling on Clustered VLIW Processors". International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES'11), 2011. Attributed to Thomas
Bondhugula, Baskaran, Krishnamoorthy, Ramanujam, Rountev, Sadayappan. "Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model". International Conference on Compiler Construction (CC'08), 2008.
Wimmer, Franz. "Linear Scan Register Allocation on SSA Form". International Symposium on Code Generation and Optimization (CGO'10), 2010. Attributed to Florent
Gulwani, Mehra, Chilimbi. "SPEED: precise and efficient static estimation of program computational complexity". Principles of Programming Languages (POPL'09), 2009. Attributed to Gabriel
Hormati, Choi, Kudlur, Rabbah, Mudge, Mahlke. "Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures". Parallel Architectures and Compilation Techniques (PACT'09), 2009. Attributed to Jean-Florent