This paper focuses on efficient parallel implementations of nested loops. Besides well studied uniform dependences, many scientific problems are expressed using loops with affine dependences. In a parallel implementation, a data is computed by a given processor (called emission processor), and has to be sent to all the processors that use it (called utilization processors). This can be done using two communication schemes:

- pipeline: the data is propagated from neighbor to neighbor among the utilization processors
- broadcast: the data is simultaneously sent to all the utilization processors

The ideal broadcast communication scheme is the following: a given data is computed, it is then sent to all its utilization processors which use it right away. We call it true broadcast. For a given data, a true broadcast can only be implemented if all the computations that use it are scheduled at the same time step. Therefore true broadcast imposes strong conditions to be verified by the schedule functions. For a problem characterized by several dependences, these conditions can not always be satisfied simultaneously.

When they are not satisfied for a given dependence, it means that the utilization processors use the data at successive time steps. Even though pipeline could be used in this case, we choose to still use broadcast to reduce the communication cost. We call it anticipated broadcast. There is a memory cost related to it: each utilization processor must store the data until it finally uses it.

The objective of this work is to compute, for given schedule and allocation functions, the memory cost related to each dependence in case of an anticipated broadcast.

In case of an anticipated broadcast, the computed data can be stored:

- on the utilization processors: it is sent immediately after its computation to all the utilization processors.
- on the emission processor between the computation and the first use of the data. The data is then sent immediately before its first use to all the utilization processors.

In order to evaluate the memory cost, we have first to compute the lifetime of a data item Y[z_0] on a given processor a: this set is a convex polyhedron parameterized by z_0, a, and the problem parameters p. The memory cost on processor a is the maximal number of intersections of all the lifetime intervals. The main result of the paper is the computation of this memory cost.

The paper first summarizes the classical framework for systems of affine recurrence equations parallelization. It particularly recalls how to determine different space x time transformations. It then shows how to compute, for a given transformation, the memory cost using appropriate Ehrhart polynomials. It illustrates the approach on the Gaussian elimination problem using two different schedules.

- Ph. Clauss, Counting Solutions to Linear and Nonlinear Constraints through Ehrhart polynomials: Applications to Analyze and Transform Scientific Programs, 10th ACM Int. Conf. on Supercomputing, Philadelphia, 1996.
- C. Lengauer, Loop parallelization in the polytop model, CONCUR'93.
- V. Loechner, C. Mongenet, Solutions to the Communication Minimization Problem for Affine Recurrence Equations, Europar'97.
- C. Mongenet, Ph. Clauss, G.-R. Perrin, Geometrical tools to map systems of affine recurrence equations on regular arrays, Acta Informatica, 1994.