Shared-memory multiprocessor (SMP) machines have become widely available. As the user community grows, so does the importance of compilers that can translate standard, sequential programs onto this machine class. Substantial research has been done to develop sophisticated parallelization techniques, but little attention has been given to issues of the backend compiler. The backend compiler's task is to translate the parallel program produced by the preprocessor into machine code. In this paper we will focus on three issues that are related to the backend compiler and its interface to the preprocessor: (1) We will determine whether it is appropriate for the preprocessor to express the detected parallelism in the common loop-oriented form, (2) we will determine sources of inefficiencies in fully parallel SMP programs that exhibit good cache locality, and (3) we will discuss the portability of these programs across SMP machines.
In our experiments we have extended the Polaris compiler, so that it can generate thread-based code directly. We compare the performance of this code with Polaris' loop-parallel OpenMP output form and with architecture-specific directive languages available on the Sun Enterprise and the SGI Origin systems. We have analyzed in detail the performance of several parallel Perfect Benchmarks. Our main findings are that (1) overall, there is no significant performance disadvantage of the loop-parallel representation, (2) however, substantial performance differences are attributable to the instruction efficiency of the backend compilers, which is influenced by the data sharing semantics of parallel constructs. We can improve the execution time up to 48% using read only data attribute. And (3) both the OpenMP and the thread-based program forms are functionally portable, but can result in substantially different performance on the two machines.