This paper focuses on efficient parallel implementations of nested loops. Besides well studied uniform dependences, many scientific problems are expressed using loops with affine dependences. In a parallel implementation, a data is computed by a given processor (called emission processor), and has to be sent to all the processors that use it (called utilization processors). This can be done using two communication schemes:
The ideal broadcast communication scheme is the following: a given data is computed, it is then sent to all its utilization processors which use it right away. We call it true broadcast. For a given data, a true broadcast can only be implemented if all the computations that use it are scheduled at the same time step. Therefore true broadcast imposes strong conditions to be verified by the schedule functions. For a problem characterized by several dependences, these conditions can not always be satisfied simultaneously.
When they are not satisfied for a given dependence, it means that the utilization processors use the data at successive time steps. Even though pipeline could be used in this case, we choose to still use broadcast to reduce the communication cost. We call it anticipated broadcast. There is a memory cost related to it: each utilization processor must store the data until it finally uses it.
The objective of this work is to compute, for given schedule and allocation functions, the memory cost related to each dependence in case of an anticipated broadcast.
In case of an anticipated broadcast, the computed data can be stored:
In order to evaluate the memory cost, we have first to compute the lifetime of a data item Y[z_0] on a given processor a: this set is a convex polyhedron parameterized by z_0, a, and the problem parameters p. The memory cost on processor a is the maximal number of intersections of all the lifetime intervals. The main result of the paper is the computation of this memory cost.
The paper first summarizes the classical framework for systems of affine recurrence equations parallelization. It particularly recalls how to determine different space x time transformations. It then shows how to compute, for a given transformation, the memory cost using appropriate Ehrhart polynomials. It illustrates the approach on the Gaussian elimination problem using two different schedules.