SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP16) took place at the Université Pierre et Marie Curie, Cordeliers Campus, Paris, France. We (Tammy, Grey, and Bora) have organized a minisymposium on tensor computations. SIAM PP16's web page contains the talk information, including the speaker names, authors, and the abstracts. The speakers of the minisymposium kindly agreed to share the slides at this page.
Tensors, or multidimensional arrays, are a natural way to represent high-dimensional data arising in a multitude of applications. Tensor decompositions, such as the CANDECOMP/PARAFAC and Tucker models, help to identify latent structure, achieve data compression, and enable other tools of data analysis. This minisymposium identifies the key operations in related algorithms for tensor decomposition. The focus of the minisymposium is on recent developments in efficient, parallel algorithms and implementations for handing large tensors. The eight talks cover three key operations in different shared memory and distributed memory parallel computing systems environments and investigates applications in scientific computing and recommender systems.
The speaker names are in italics.
Title: Parallel Tensor Compression for Large-Scale Scientific Data
Authors: Woody N. Austin (University of Texas at Austin, USA); Grey Ballard, and Tamara G. Kolda (Sandia National Laboratories, USA)
Abstract: Today's computational simulations often generate more data than can be stored or analyzed. For multi-dimensional data, we can use tensor approximations like the Tucker decomposition to expose low-rank structure and compress the data. This talk presents a distributed-memory parallel algorithm for computing the Tucker decomposition for general dense tensors. We show performance results and demonstrate large compression ratios (with small error) for data generated by applications of combustion science.
Title: Parallelizing and Scaling Tensor Computations
Authors: Muthu M. Baskaran, Benoit Meister, and Richard Lethin (Reservoir Labs, USA)
Abstract: Tensor decomposition is a powerful data analysis technique for extracting information from large-scale multidimensional data and is widely used in diverse applications. Scaling tensor decomposition algorithms on large computing systems is key for realizing the applicability of these powerful techniques on large datasets. We present our tool - ENSIGN Tensor Toolbox (ETTB) - that has optimized tensor decompositions through the application of novel techniques for improving the concurrency, data locality, load balance, and asynchrony in the execution of tensor computations that are critical for scaling them on large systems.
Title: Coupled Matrix and Tensor Factorizations: Models, Algorithms, and Computational Issues
Authors: Evrim Acar (University of Copenhagen, Denmark)
Abstract: In the era of data science, it no longer suffices to analyze a single data set if the goal is to unravel the dynamics of a complex system. Therefore, data fusion, i.e., extracting knowledge through the fusion of complementary data sets, is a topic of interest in many fields. We formulate data fusion as a coupled matrix and tensor factorization problem, and discuss extensions of the model as well as various algorithmic approaches for fitting such data fusion models.
Title: Parallel Algorithms for Tensor Completion and Recommender Systems
Authors: Lars Karlsson (Umeå University, Sweden), Daniel Kressner (EPFL, Switzerland) and Andre Uschmajew (University of Bonn, Germany)
Abstract: Low-rank tensor completion is concerned with filling in missing entries in multi-dimensional data. It has proven its versatility in numerous applications, including context-aware recommender systems and multivariate function learning. To handle large-scale datasets and applications that feature high dimensions, the development of distributed algorithms is central. In this talk, we will discuss novel, highly scalable algorithms based on a combination of the canonical polyadic (CP) tensor format with block coordinate descent methods. Although similar algorithms have been proposed for the matrix case, the case of higher dimensions gives rise to a number of new challenges and requires a different paradigm for data distribution. The convergence of our algorithms is analyzed and numerical experiments illustrate their performance on distributed-memory architectures for tensors from a range of different applications.
Title: High Performance Parallel Sparse Tensor Decompositions
Authors: Oguz Kaya (Inria and ENS Lyon, France) and Bora Uçar (CNRS and ENS Lyon, France)
Abstract: We investigate an efficient parallelization of the iterative algorithms for computing CANDECOMP-PARAFAC (CP) and Tucker sparse tensor decompositions on shared and distributed memory systems. The key operations of each iteration of these algorithms are matricized tensor times Khatri-Rao product (MTTKRP) and n-mode tensor-matrix product. These operations amount to element-wise vector multiplication or vector outer product, followed by a reduction, depending on the sparsity of the tensor. We first investigate a fine and a coarse-grain task definition for these operations, and propose methods to achieve load balance and reduce the communication requirements for their parallel execution on distributed memory systems. Then we investigate efficient parallelization issues on shared memory systems. Based on these investigations, we implement the alternating least squares method for both decompositions in a software library. The talk will contain results for both decomposition schemes on large computing systems with real-world sparse tensors. We have a full-featured implementation for the CP-decomposition scheme and development for the Tucker decomposition in progress.
Title: Efficient Factorization with Compressed Sparse Tensors
Authors: Shaden Smith (University of Minnesota, USA) and George Karypis (University of Minnesota and Army HPC Research Center, USA)
Abstract: Computing the Canonical Polyadic Decomposition is bottlenecked by multiplying a sparse tensor by several dense matrices (MTTKRP). MTTKRP algorithms either use a compressed tensor specific to each dimension of the data or a single uncompressed tensor at the cost of additional operations. We introduce the compressed sparse fiber data structure and several parallel MTTKRP algorithms that need only a single compressed tensor. We reduce memory by 58% while achieving 81% of the parallel performance.
Title: Low Rank Bilinear Algorithms for Symmetric Tensor Contractions
Authors: Edgar Solomonik (ETH Zürich, Switzerland)
Abstract: We demonstrate algebraic reorganizations that reduce by two the number of scalar multiplications (bilinear algorithm rank) needed for matrix-vector multiplication with a symmetric matrix and the rank-two symmetric update. We then introduce symmetry-preserving algorithms, which lower the number of scalar multiplications needed for most types of symmetric tensor contractions. By being nested with matrix multiplication, these algorithms reduce the total number of operations needed for squaring matrices and contracting partially-symmetric tensors.
Title: An Input-Adaptive and In-Place Approach to Dense Tensor-Times-Matrix Multiply
Authors: Jiajia Li, Casey Battaglino, Ioakeim Perros, Jimeng Sun, and Richard Vuduc (Georgia Institute of Technology, USA)
Abstract: This talk describes a novel framework, called InTensLi ("intensely"), for producing fast single-node implementations of dense tensor-times-matrix multiply (TTM) of arbitrary dimension. Whereas conventional implementations of TTM rely on explicitly converting the input tensor operand into a matrix in order to be able to use any available and fast general matrix-matrix multiply (GEMM) implementation our framework's strategy is to carry out the TTM in-place, avoiding this copy. As the resulting implementations expose tuning parameters, this paper also describes a heuristic empirical model for selecting an optimal configuration based on the TTM's inputs. When compared to widely used single-node TTM implementations that are available in the Tensor Toolbox and Cyclops Tensor Framework (CTF), InTensLi's in-place and input-adaptive TTM implementations achieve 4x and 13x speedups, showing GEMM-like performance on a variety of input sizes.