This paper presents a new parallelization method for reductions of arrays with subscripted subscripts on scalable shared memory multiprocessors.
The mapping of computations is based on grouping reduction loop iterations into sets that are further distributed across processors. Iterations belonging to the same set are chosen in such a way that update different entries in the reduction array. That is, the loop distribution implies a conflict-free write distribution of the reduction array. The iteration sets are set up by building a loop-index prefetching array that allows to reorder properly the loop iterations. The prefetching array depends on the contents of the subscript array and, hence, it is determined only once in all static codes, before entering the reduction loop. Such codes include those dealing with static irregular meshes using, for instance, finite element methods. In case of dynamic codes, as in n-body simulations, however, the prefetching array must be recomputed, at least partially, every certain number of iterations of the reduction loop.
The proposed method is general, scalable, and easy to implement on a compiler. In addition it deals in a uniform way with one and multiple subscript arrays. In case of multiple indirection arrays, writes on the reduction vector affecting different sets are solved by defining conflict-free supersets.
A performance evaluation and comparison with other existing techniques is presented. From the experimental results and performance analysis, the proposed method appears as a clear alternative to the array expansion and privatized buffer techniques, usual on state-of-the-art parallelizing compilers, like Polaris or SUIF. The scalability problem that those techniques exhibit is missing in our method, as the memory overhead presented does not depend on the number of processors.