Memory Cost due to Anticipated Broadcast

Vincent Loechner, Catherine Mongenet

Context and objectives

This paper focuses on efficient parallel implementations of nested loops. Besides well studied uniform dependences, many scientific problems are expressed using loops with affine dependences. In a parallel implementation, a data is computed by a given processor (called emission processor), and has to be sent to all the processors that use it (called utilization processors). This can be done using two communication schemes:

On a parallel architecture where broadcast is efficiently implemented, the second communication scheme should be chosen. This is in particular relevant when the network is a bus, as it occurs in a cluster of workstations connected by ethernet for example. This approach is of course irrelevant on any parallel architecture where the broadcast is simulated by a pipeline, such as on systolic-like architectures.

Anticipated broadcast communications schemes

The ideal broadcast communication scheme is the following: a given data is computed, it is then sent to all its utilization processors which use it right away. We call it true broadcast. For a given data, a true broadcast can only be implemented if all the computations that use it are scheduled at the same time step. Therefore true broadcast imposes strong conditions to be verified by the schedule functions. For a problem characterized by several dependences, these conditions can not always be satisfied simultaneously.

When they are not satisfied for a given dependence, it means that the utilization processors use the data at successive time steps. Even though pipeline could be used in this case, we choose to still use broadcast to reduce the communication cost. We call it anticipated broadcast. There is a memory cost related to it: each utilization processor must store the data until it finally uses it.

The objective of this work is to compute, for given schedule and allocation functions, the memory cost related to each dependence in case of an anticipated broadcast.

In case of an anticipated broadcast, the computed data can be stored:

Any processor must keep the data from the instant it receives it, till the instant it uses it for the last time. During this lifetime interval, it may receive in the same way other data. All of them must be stored and the total amount of required memory is the maximum number of such data simultaneously stored on the processor.

Memory cost evaluation

This symbolic evaluation is conducted in the classical polytope model. The information we manipulate can be represented by parameterized convex polyhedra. One of the techniques used to extract information from such polyhedra relies on counting the number of integer points it contains. This symbolic evaluation is realized using Ehrhart polynomials which express this number as a function of the parameters.

In order to evaluate the memory cost, we have first to compute the lifetime of a data item Y[z_0] on a given processor a: this set is a convex polyhedron parameterized by z_0, a, and the problem parameters p. The memory cost on processor a is the maximal number of intersections of all the lifetime intervals. The main result of the paper is the computation of this memory cost.

The paper first summarizes the classical framework for systems of affine recurrence equations parallelization. It particularly recalls how to determine different space x time transformations. It then shows how to compute, for a given transformation, the memory cost using appropriate Ehrhart polynomials. It illustrates the approach on the Gaussian elimination problem using two different schedules.

References