Processing and analyzing large volumes of data plays an increasingly important role in many domains of scientific research. Typical examples of very large scientific datasets include long running simulations of time-dependent phenomena that periodically generate snapshots of their state, archives of raw and processed remote sensing data, and archives of medical images. There are a number of systems that have been designed to target queries into such large scale multi-dimensional datasets or to visualize such datasets. However, support for developing applications that analyze and process such datasets has been lacking.
We are developing language extensions and compilation framework for expressing the applications that process large multidimensional datasets in a high-level data-parallel fashion. We have chosen a dialect of Java for expressing these applications. We have chosen Java because the application we target can be conveniently expressed using an object-oriented languages and because a number of projects are currently in progress for expressing parallel computations in Java and obtaining good performance on scientific applications. Our dialect of Java includes data-parallel extensions for specifying collection of objects, a parallel for loop, distribution functions and reduction variables.
Our compiler will analyze nested parallel loops and optimize the processing of datasets through the use of an existing runtime system, called Active Data Repository (ADR), developed at University of Maryland. A number of compilation issues need to be addressed to be able to generate code for such applications. More details of the compiler/runtime interface also need to be worked out. We are also developing a set of novel optimizations techniques to enable high performance from the compiled code. Details of these techniques and issues will be presented in the full paper.