Cours de Master 2012 -- 2012 Master Course
Compilation avancée et optimisation de programmes
(Advanced compilation and program optimizations)
Where, when, who?
Courses take place in room B2, on some (not
all) Monday and Wednesday, from 4pm to 6pm.
Current students are:
Gabriel Radanne 10
Thomas Lambert 11
Pierre-Alain Scribot 1
Silviu-Ioan Filip 11
Vincent Despré 1
Jean-Florent Raymond 10
Florent Robic 10
Théo Dzierbicki 1
Content of the course
Master course presentation
Friday, Sept. 7: slides.
Short (5') presentations, for all students, among all courses.
1st course: Introduction
Wednesday, Sept. 12: slides.
General statements/views on compilation and code optimizations
Why loops are important
Examples of architectural peculiarities
2nd course: Software pipelining
Monday, Sept. 17: slides.
Model: graph, delays, dependence distances, average cycle time
Algorithm for optimal schedule without resource constraints
Example and sketch of modulo scheduling heuristic
DAG, list scheduling, and Graham's upper bound.
Retiming and loop compaction: introduction
For more details, you can take a look at:
Gondran and Minoux. "Graphs and Algorithms". Wiley. 1984.
Rau. "Iterated Modulo Scheduling". International Journal of Parallel Programming, 24(1):3-64, 1996.
Darte, Vivien, and Robert. "Scheduling and Automatic Parallelization (Chap. 1 & 3)". Birkhäuser. 2000.
3rd course: Retiming, loop compaction, loop fusion
Wednesday,
Sept. 20: slides. Note: slides
28-40 skipped.
Retiming and Leiserson-Saxe algorithm for clock period minimization
Graham's upper bound adapted for loop compaction and software pipelining
Reminders on linear programming
Out-of-kilter algorithm for minimization, by retiming, of the number of intra-loop dependences
Impact of retiming on memory usage (introduced but skipped)
Loop fusion problems without retiming (next time, with retiming)
For more details, you can take a look at:
Gondran and Minoux. "Graphs and Algorithms". Wiley. 1984.
Gasperoni and Schwiegelshohn. "Generating Close to Optimum Loop Schedules on Parallel Processors". Parallel Processing Letters, 4(4):391-403, 1994.
Darte and Huard. "Loop Shifting for Loop Compaction". International Journal of Parallel Programming. 28(5):499-534, 2000.
Kennedy and McKinley. "Typed Fusion with Applications to Parallel and Sequential Code Generation". Tech. Report 94646, Rice University, 1994.
Darte. "On the Complexity of Loop Fusion". Parallel Computing, 26(9):1175-1193, 2000.
4th course: End of previous course + exercices
Monday, Oct. 1: slides and
exercises (to finish as home work).
End of previous course: Fusion with retiming.
Exercise 1, question 5.
Exercise 2, question 1 and question 2 (beginning).
5th course: Back-end code representations + exercices
Wednesday, Oct. 3: slides.
Control-flow graphs and dominance.
Loop nesting forest: algorithms of Tarjan and Havlak (modified by Ramalingam).
Reducible and irreducible CFGs.
Exercise 4, questions 1, 2, 3 (beginning).
For more details, you can take a look at:
Ramalingam. "On loops, dominators, and dominance frontiers". ACM Transactions on Programming Languages and Systems, 24(5):455-490, 2002.
Ramalingam. "Identifying Loops in Almost Linear Time". ACM Transactions on Programming Languages and Systems, 21(2):175-188, 1999.
6th course: Static single assignment form
Wednesday, Oct. 10: slides.
Static single assignment: definition.
Construction with dominance frontier. Example from Nelson
Amaral's slides.
Translation out of SSA. Correctness, coalescing, interferences with values.
For more details, you can take a look at:
Cytron, Ferrante, Rosen, Wegman, Zadeck. "Efficiently computing static single assignment form and the control dependence graph". ACM Transactions on Programming Languages and Systems, 13(4):451-490, 1991.
Briggs, Cooper, Harvey, Simpson. "Practical improvements to the construction and destruction of static single assignment form". Software -- Practice and Experience, 28(8):859-881, 1998.
Sreedhar, Ju, Gillies, Santhanam. "Translating out of static single assignment form". In Static Analysis Symposium (SAS'99), Italy, 1999, pp. 194-204.
Boissinot, Darte, Dupont de Dinechin, Guillon, Rastello. "Revisiting out-of-SSA translation for correctness, code quality, and efficiency". International Symposium on Code Generation and Optimization (CGO'09), pp. 114-125, IEEE, 2009.
7th course: SSA (end) and register allocation (beginning)
Wednesday, Oct. 17: slides SSA et slides register allocation.
Going out of SSA: fast version.
Properties of SSA, in particular liveness.
Register allocation: problem statement.
Register allocation: iterated register coalescing (with some slides from Nelson Amaral).
For more details, you can take a look at:
Chaitin, Auslander, Chandra, Cocke, Hopkins, Markstein. "Register
allocation via coloring". Computer Languages, 6:47-57, 1981.
Chaitin. "Register allocation & spilling via graph coloring". ACM
SIGPLAN Symposium on Compiler Construction, SIGPLAN Notices,
17(6):98-105, 1982.
George and Appel. "Iterated register coalescing". ACM
Transactions on Programming Languages and Systems, 18(3):300-324,
1996.
8th course: register allocation (end) and polyhedral optimizations (beginning)
Monday, Oct. 22: slides register allocation et slides polyhedral optimizations.
Tracing the NP-completeness of register allocation.
Introduction to polyhedral optimizations.
Decomposition of Karp, Miller, and Winograd.
For more details, you can take a look at:
Bouchez, Darte, Guillon, Rastello. "Register allocation: What
does the NP-completeness proof of Chaitin et al. really prove? Or
revisiting register allocation: why and how". International Workshop
on Languages and Compilers for Parallel Computing (LCPC'06), LNCS
4382, pp. 283-298, 2007.
Karp, Miller, Winograd. "The organization of computations for uniform recurrence equations". Journal of the ACM, 14(3):563--590, 1967.
Darte, Vivien. "Revisiting the Decomposition of Karp, Miller, and
Winograd". Parallel Processing Letters 5:551-562, 1995.
Darte, Vivien, and Robert. "Scheduling and Automatic Parallelization (Chap. 4)". Birkhäuser. 2000.
9th course: applications of KMW decomposition
Tuesday, Oct. 23: slides.
Catalog of loop transformations.
Detection of parallelism in nested DO loops.
Link with termination of while loops.
For more details, you can take a look at:
Allen, Kennedy. "Optimizing Compilers for Modern Architectures: A Dependence-based Approach". Morgan Kaufman, 2002.
Darte, Vivien, and Robert. "Scheduling and Automatic Parallelization (Chap. 5)". Birkhäuser. 2000.
Allen, Callahan, Kennedy. "Automatic decomposition of scientific
programs for parallel execution", Conference on Principles of Parallel
Languages, pp. 63-76, 1987.
Feautrier. "Dataflow analysis of array and scalar references". International Journal of Parallel programming, 20(1):23-53, 1991.
Feautrier. "Some efficient solutions to the affine scheduling problem. Part II, multidimensional time". International Journal of Parallel Programming 21(6): 389:420, 1992.
Alias, Darte, Feautrier, Gonnord. "Multi-dimensional Rankings,
Program Termination, and Complexity Bounds of Flowchart
Programs". In Static Analysis Symposium (SAS'10), pp. 117-133, 2010.
Darte. "Optimal Parallelism Detection in Nested Loops". Encyclopedia of Parallel Programming. Springer, 2011.
and many more...
10th course: static analysis for code optimizations (by Laure Gonnord)
Monday, Nov. 12: slides.
Data-flow analysis.
Abstract interpretation.
Linear relation analysis.
For more details, you can take a look at:
Nielson, Nielson, Hankin, "Principles of Program Analysis", Springer, 2004.
Cousot, Halbwachs. "Automatic discovery of linear restraints among
variables of a program", POPL 1978.
Monniaux, Gonnord. "Using bounded model checking to focus fixpoint
iterations". SAS 2011.
Halbwachs, Proy, Raymond. "Verification of linear hybrid systems by means
of convex approximations", SAS 1994.
Henry, Monniaux, Moy. "PAGAI: a path sensitive static analyzer". In
Bertrand Jeannet, editor, Tools for Automatic Program Analysis (TAPAS), 2012.
11th course: kernel offloading and optimized communications
Tuesday, Nov. 20: slides.
Kernel offloading. Application to high-level synthesis.
Double-buffering execution style with loop tiling.
Optimized communication coalescing with inter-tile reuse.
For more details, you can take a look at:
Darte, Schreiber, Villard. "Lattice-Based Memory Allocation". IEEE Transactions on Computers, 54(10):1242-1257, 2005.
Alias, Darte, Plesco. "Optimizing DDR-SDRAM Communications at
C-Level for Automatically-Generated Hardware Accelerators. An
Experience with the Altera C2H HLS Tool". International Conference on
Application-specific Systems, Architectures and Processors (ASAP'10),
2010.
Alias, Darte, Plesco. "Optimizing Remote Accesses for Offloaded
Kernels: Application to High-Level Synthesis for FPGA". 2nd
International Workshop on Polyhedral Compilation Techniques
(IMPACT'12), 2012.
Chavarría-Miranda, Mellor-Crummey. "Effective communication
coalescing for data-parallel applications". Principles and Practice of
Parallel Programming (PPoPP'05), 2005.
Issenin, Borckmeyer, Miranda, Dutt. "DRDU: A data reuse analysis
technique for efficient scratchpad memory management". ACM TODAES
2007.
Baskaran, Bondhugula, Krishnam, Ramanujam, Rountev,
Sadayappan. "Automatic data movement and computation mapping for
multi-level parallel architectures with explicitly managed
memories". Principles and Practice of Parallel Programming (PPoPP'08),
2008.
Leung, Vasilach, Meister, Baskaran, Wohlford, Bastoul, Lethin. "A
mapping path for multi-GPGPU accelerated computers from a portable
high level programming abstraction". Workshop on General Purpose
Processing on Graphics Processing Units (GPGPU'10), 2010.
Guelton, Amini, Creusillet. "Beyond do loops: data transfer
generation with convex array regions". International Workshop on
Languages and Compilers for Parallel Computing (LCPC'12), 2012.
and many more...
12th session: readings. Give me your preferences!
- Di, Ye, Su, Sui, Xue. "Automatic Parallelization of Tiled Loop
Nests with Enhanced Fine-Grained Parallelism on GPUs" . International
Conference on Parallel Processing (ICPP'12),
2012. Attributed to Silviu
- Zhang, Wu, Xue. "An Efficient Heuristic for Instruction
Scheduling on Clustered VLIW Processors". International Conference on
Compilers, Architecture, and Synthesis for Embedded Systems
(CASES'11),
2011. Attributed to Thomas
- Bondhugula, Baskaran, Krishnamoorthy, Ramanujam, Rountev,
Sadayappan.
"Automatic Transformations for Communication-Minimized Parallelization
and Locality Optimization in the Polyhedral Model". International
Conference on Compiler Construction (CC'08), 2008.
- Wimmer,
Franz.
"Linear Scan Register Allocation on SSA Form". International
Symposium on Code Generation and Optimization (CGO'10), 2010. Attributed to Florent
- Gulwani, Mehra,
Chilimbi. "SPEED:
precise and efficient static estimation of program computational
complexity". Principles of Programming Languages (POPL'09), 2009. Attributed to Gabriel
- Hormati, Choi, Kudlur, Rabbah, Mudge,
Mahlke.
"Flextream: Adaptive Compilation of Streaming Applications for
Heterogeneous Architectures". Parallel Architectures and
Compilation Techniques (PACT'09), 2009. Attributed to Jean-Florent