Minisymposium: Sparse matrix and tensor computations at PMAA16

The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16) was held in Bordeaux France, July 6--8, 2016. We (Albert-Jan and Bora) have organized a minisymposium on sparse matrix and tensor computations. PMAA16's web page contains the information on talks, including the speaker names, authors, and the abstracts. The speakers of the minisymposium kindly agreed to share the slides at this page.

Sparse matrix computations are hard for a variety of reasons: one is the low arithmetic intensity of typical sparse computation, resulting in heavily bandwidth-bound algorithms. Another is the irregular and unpredictable access patterns sparse computations induce, resulting in low cache use efficiency for a wide range of applications. These facts make sparse computations extremely sensitive to implementation aspects such as data structure organisation and compression, and also to architectural features such as NUMA domains and cache hierarchies. Optimisations that handle all the above have may lead to complex data layouts and computational strategies, while the sparse computations thus sped up are commonly but one computational hot spot in a much larger numerical application: an equally important aspect of optimising sparse computations thus also is the usability of the resulting techniques.

This minisymposium furthermore aims to demonstrate the applicability of sparse matrix computing techniques to more generic use cases, such as sparse tensor computations. The generalisation also extends to the hardware domain, where existing methods may be augmented for use on accelerators such as the Xeon Phi or the GPU.

The speaker names are in italics.

Organizers

Bora Uçar (CNRS and ENS Lyon, France)
Albert-Jan Yzelman (Huawei Technologies, France).

Talks

Title: High performance sparse tensor decomposition

Authors: Oguz Kaya (Inria and ENS Lyon) and Bora Uçar (CNRS and ENS Lyon, France

Abstract: Tensor methods have increasingly been employed to better analyze datasets with many features and higher dimensionality than matrix-based methods. Tucker tensor decomposition has successfully been applied to real-world problems such as web search, hyperlink analysis of web pages, and recommender systems, albeit being computationally expensive for large datasets. This talk focuses on an efficient computation and parallelization of the Higher Order Orthogonal Iteration (HOOI) algorithm to compute the Tucker decomposition of very big sparse tensors and enable tensor-based analysis of such datasets. We investigate the parallelization of the major steps of the HOOI algorithm such as tensor-times-matrix-chain multiply and truncated SVD operations. We then examine reducing the load imbalance and the communication cost of the parallel algorithm for better scalability. Finally, we present scalability results up to 4096 cores on 256 nodes of an IBM BlueGene/Q supercomputer of the MPI+OpenMP parallel implementation of the algorithm.

Slides.
Title: An Exploration of Optimization Algorithms for High Performance Tensor Completion

Authors: Shaden Smith (Computer Science and Engineering Department, University of Minnesota), Jongsoo Park (Intel, USA), and George Karypis (Computer Science and Engineering Department, University of Minnesota)

Abstract: Many domains rely on multi-way data, which are variables that interact in three or more dimensions, or modes. An electronic health record is an interaction between variables such as a patient, symptoms, diagnosis, medical procedures, and outcome. Similarly, how much a customer will like a product is an interaction between the customer, product, and the context in which the purchase occured (e.g., date of purchase or location). Analyzing multi-way data can provide valuable insights about the underlying relationships of the interacting variables. Utilizing these insights, a doctor would be more equipped to reach a provide a successful treatment and a retailer would be able to better recommend products that meet the customer's needs and preferences. Tensors are a natural way of representing multi-way data. Tensor completion is the problem of estimating or recovering missing values of a tensor. For example, discovering phenotypes in electronic health records is improved by tensor completion due to missing and noisy data. Similarly, predicting how a customer will rate a product under some context can be thought of as estimating a missing value in a tensor.

Multi-way data analysis follows the assumption that the data of interest follows a low-rank model that can be discovered. Tensor factorization is a technique that reduces a tensor to a low-rank representation, which can then be used by applications or domain experts. Tensor completion is often accomplished by finding a low-rank tensor factorization for the known data, and if a low-rank model exists then it can be used to predict the unknown data. A subtle, but important constraint is that the factorization must only capture the non-zero (or observed) entries of the tensor. The remaining entries are treated as missing values, not actual zeros as is often the case in other sparse tensor and matrix operations.

Tensor completion is challenging on modern processors for several reasons. Modern architectures have lower ratios of memory bandwidth to compute capabilities, which is detrimental to tensors which have highly unstructured access patterns and three or more indices per non-zero value. Furthermore, processors have more parallelism and load balance is difficult to achieve because tensors do not have uniformly distributed non-zeros and often have a combination of long, sparse modes (e.g., patients or customers) and short, dense modes (e.g., medical procedures or temporal information).

The high performance computing community has addressed some of these challenges in recent years, with research spanning both shared-memory and distributed-memory systems. However, the techniques and optimizations that underlie these methods are applied to factorizations that are not suitable for tensor completion due to the treatment of missing entries.

In this work, we explore the task of high performance tensor completion with three popular optimizations algorithms: alternating least squares (ALS), stochastic gradient descent (SGD), and coordinate descent (CCD++). We address issues on shared- and distributed-memory systems such as memory- and operation-efficient algorithms, cache locality, load balance, and communication. Our contributions include: i) Hybrid MPI+OpenMP implementations of ALS, SGD, and CCD++ that utilize compressed tensor representations in order to improve cache locality and reduce memory consumption and the number of FLOPs performed; ii) A distributed-memory SGD algorithm that combines stratification and asynchronous updates to improve scalability; iii) A method of load balancing in the presence of sparse and dense modes; iv) An experimental evaluation with several real-world datasets on up to 1024 cores. Our ALS and CCD++ algorithms are 153x and 21.4x faster than state-of-the-art parallel methods, respectively. This effectively reduces solution time from hours to seconds; v) We show that depending on the underlying parallel architecture and the characteristics of the desired solution, the best performing optimization method varies.

Slides.
Title: An Empirical Study of Sparse BLAS on Emerging Heterogeneous Processors

Authors: Weifeng Liu (University of Copenhagen, Denmark) and Brian Vinter (University of Copenhagen, Denmark)

Abstract: In recent years, low-throughput GPUs have been integrated onto the same chip as the CPU. AMD APUs, Intel CPU-GPU SoCs and nVidia Tegra are representatives in this trend. The newest hardware progress, such as unified virtual address space and shared last level caches, makes tightly coupled CPU-GPU heterogeneous processors a promising tool for scientific computing. This talk will focus on our empirical study of performance behaviors of sparse BLAS routines (e.g., SpTRANS, SpMV, SpTRSV and SpGEMM) on emerging heterogeneous processors. A performance comparison with modern multi-core and many-core processors will also be presented.

Slides.
Title: Making effective sparse matrix-vector multiplication both portable and usable

Authors: Albert-Jan Yzelman (Huawei Technologies, France)

Abstract:Past research into efficient formats for sparse matrices has identified three major performance aspects: maximising bandwidth use, cache reuse, and data locality. On any architecture, whether they be contemporary CPUs, accelerators such as GPUs, or co-processors like the Intel Xeon Phi, the data structure additionally is tuned to make maximum use of their respective parallelisation capabilities: multi-threading, vectorisation, or both.

This led to a veritable plethora of data structures that, in practice, remain mostly unused. For reasons of portability and usability, the de facto data structure remains to be Compressed Row Storage (CRS, also known as CSR), or even the simple coordinate format (COO); while both CRS and COO have disadvantages are theoretically described, that are furthermore and repeatedly have shown to lead to practical and noticeable loss in performance. Are we thus perhaps in a case of premature optimisation, and should we focus on optimising CRS and COO instead of trying to replace them? Or is it that performance should not be the main metric for success, and that future research should address more the portability and usability aspects of their proposed solutions?

While the state of the art is exploring both directions, this talk focuses on the latter: we explore the high-level trade-offs on the possible data structures when taking into account bandwidth, cache, and locality from a performance perspective; we present a framework for dealing with vectorisation from a generic point of view; and we discuss various use cases, specifically, how to hide the complexity of high-performance data structures from application programmers.

Slides.