Somesh Singh

Research

My research interests span the broad areas of high-performance computing and parallel computing, with an emphasis on making the processing of irregular-workloads on parallel platforms more efficient. The focus of my Ph.D. research was on accelerating parallel graph processing using approximate computing. Details about my Ph.D. dissertation can be found here. Currently, my focus is on optimizing and enhancing the performance of sparse tensor computations on parallel architectures.


Publications

* : The authors are listed in alphabetical order by lastname.

Algorithms and data structures for hyperedge queries*

Jules Bertrand, Fanny Dufossé, Somesh Singh, Bora Uçar,

in ACM JEA. [Accepted: 2022.]

(pdf) (preprint) (doi)

We consider the problem of querying the existence of hyperedges in hypergraphs. More formally, given a hypergraph, we need to answer queries of the form: "does the following set of vertices form a hyperedge in the given hypergraph?". Our aim is to set up data structures based on hashing to answer these queries as fast as possible. We propose an adaptation of a well-known perfect hashing approach for the problem at hand. We analyze the space and run time complexity of the proposed approach, and experimentally compare it with the state-of-the-art hashing-based solutions. Experiments demonstrate the efficiency of the proposed approach with respect to the state-of-the-art.

An Efficient Parallel Implementation of a Perfect Hashing Method for Hypergraphs*

Somesh Singh, Bora Uçar,

in GrAPL: Workshop on Graphs, Architectures, Programming, and Learning (IPDPSW), 2022.

(pdf) (preprint) (doi)

Querying the existence of an edge in a given graph or hypergraph is a building block in several algorithms. Hashing-based methods can be used for this purpose, where the given edges are stored in a hash table in a preprocessing step, and then the queries are answered using the lookup operations. While the general hashing methods have fast lookup times in the average case, the worst case run time is much higher. Perfect hashing methods take advantage of the fact that the items to be stored are all available and construct a collision free hash function for the given input, resulting in an optimal lookup time even in the worst case. We investigate an efficient shared-memory parallel implementation of a recently proposed perfect hashing method for hypergraphs. We experimentally compare the resulting parallel algorithms with the state-of-the-art and demonstrate better run time and scalability on a set of hypergraphs corresponding to real-life sparse tensors.

ParTBC: Faster Estimation of Top-k Betweenness Centrality Vertices on GPU

Somesh Singh, Tejas Shah, Rupesh Nasre,

in ACM TODAES 2021.

(pdf) (doi)

Betweenness centrality (BC) is a popular centrality measure, based on shortest paths, used to quantify the importance of vertices in networks. It is used in a wide array of applications including social network analysis, community detection, clustering, biological network analysis, and several others. The state-of-the-art Brandes' algorithm for computing BC has time complexities of 𝓞( |V|·|E| ) and 𝓞( |V|·|E| + |V|2·log|V| ) for unweighted and weighted graphs, respectively. Brandes' algorithm has been successfully parallelized on multicore and manycore platforms. However, the computation of vertex BC continues to be time-consuming for large real-world graphs. Often, in practical applications, it suffices to identify the most important vertices in a network; that is, those having the highest BC values. Such applications demand only the top vertices in the network as per their BC values but do not demand their actual BC values. In such scenarios, not only is computing the BC of all the vertices unnecessary but also exact BC values need not be computed. In this work, we attempt to marry controlled approximations with parallelization to estimate the k-highest BC vertices faster, without having to compute the exact BC scores of the vertices. We present a host of techniques to determine the top-k vertices faster, with a small inaccuracy, by computing approximate BC scores of the vertices. Aiding our techniques is a novel vertex-renumbering scheme to make the graph layout more structured, which results in faster execution of parallel Brandes' algorithm on GPU. Our experimental results, on a suite of real-world and synthetic graphs, show that our best performing technique computes the top-k vertices with an average speedup of 2.5× compared to the exact parallel Brandes' algorithm on GPU, with an error of less than 6%. Our techniques also exhibit high precision and recall, both in excess of 94%.

Graffix: Efficient Graph Processing with a Tinge of GPU-Specific Approximations

Somesh Singh, Rupesh Nasre,

in ICPP 2020.

(pdf) (doi) (talk)

Parallelizing graph algorithms on GPUs is challenging due to the irregular memory accesses involved in graph traversals. In particular, three important GPU-specific aspects affect performance: memory coalescing, memory latency, and thread divergence. In this work, we attempt to tame these challenges using approximate computing. We target graph applications on GPUs that can tolerate some degradation in the quality of the output, for obtaining the result in short order. We propose three techniques for boosting the performance of graph processing on the GPU by injecting approximations in a controlled manner. The first one creates a graph isomorph that brings relevant nodes nearby in memory and adds controlled replica of nodes to improve coalescing. The second reduces memory latency by facilitating the processing of subgraphs inside shared memory by adding edges among specific nodes and processing well-connected subgraphs iteratively inside shared memory. The third technique normalizes degrees across nodes assigned to a warp to reduce thread divergence. Each of the techniques offers notable performance benefits, and provides a knob to control the amount of inaccuracy added to an execution. We demonstrate the effectiveness of the proposed techniques using a suite of five large graphs with varied characteristics and five popular graph algorithms.

SixTrack V and runtime environment

with R. De Maria, J. Andersson, V.K. Berglyd Olsen, L. Field, M. Giovannozzi, P.D. Hermes, N. Høimyr, S. Kostoglou, G. Iadarola, E. Mcintosh, A. Mereghetti, J. Molson, D. Pellegrini, T. Persson, M. Schwinzerl, E.H. Maclean, K.N. Sjobak, I. Zacharov,

in International Journal of Modern Physics A 2019. (Invited Paper)

(pdf) (doi)

SixTrack is a single-particle tracking code for high-energy circular accelerators routinely used at CERN for the Large Hadron Collider (LHC), its luminosity upgrade (HL-LHC), the Future Circular Collider (FCC) and the Super Proton Synchrotron (SPS) simulations. The code is based on a 6D symplectic tracking engine, which is optimized for long-term tracking simulations and delivers fully reproducible results on several platforms. It also includes multiple scattering engines for beam–matter interaction studies, as well as facilities to run the integrated simulations with external particle matter interaction codes. These features differentiate SixTrack from general-purpose, optics-design software. The code recently underwent a major restructuring to merge the advanced features into a single branch, such as multiple ion species, interface with external codes and high-performance input/output. This restructuring also removed a large number of compilation flags, instead enabling/disabling the functionality with runtime options. In the process, the code was moved from Fortran 77 to Fortran 2018 standard, also allowing and achieving a better modularization. Physics models (beam–beam effects, Radio-Frequency (RF) multipoles, current carrying wires, solenoid and electron lenses) and methods (symplecticity check) have also been reviewed and refined to offer more accurate results. The SixDesk runtime environment allows the user to manage the large batches of simulations required for accurate predictions of the dynamic aperture. SixDesk supports several cluster environments available at CERN as well as submitting jobs to the LHC@Home volunteering computing project, which enables volunteers contributing with their hardware to CERN simulation. SixTrackLib is a new library aimed at providing a portable and flexible tracking engine for single- and multi-particle problems using the models and formalism of SixTrack. The library is able to run in CPUs as well as graphical processing units (GPUs). This contribution presents the status of the code, summarizes the main existing features and provides details about the main development lines SixTrack, SixDesk and SixTrackLib.

Optimizing Graph Processing on GPUs using Approximate Computing: Poster

Somesh Singh, Rupesh Nasre,

in PPoPP 2019.

(pdf) (doi)

Parallelizing graph algorithms on GPUs is challenging due to the irregular memory accesses and control-flow involved in graph traversals. In this work, we tame these challenges by injecting approximations. In particular, we improve memory coalescing by renumbering and replicating nodes, memory latency by adding edges among specific nodes brought into shared memory, and thread-divergence by normalizing degrees across nodes assigned to a warp. Using a suite of graphs with varied characteristics and five popular algorithms, we demonstrate the effectiveness of our proposed techniques. Our approximations for coalescing, memory latency and thread-divergence lead to mean speedups of 1.3×, 1.41× and 1.06× achieving accuracies of 83%, 78% and 84%, respectively.

SixTrack Project: Status, Runtime Environment, and New Developments

with R. De Maria, J. Andersson, V.K. Berglyd Olsen, L. Field, M. Giovannozzi, P.D. Hermes, N. Høimyr, S. Kostoglou, G. Iadarola, E. Mcintosh, A. Mereghetti, J. Molson, D. Pellegrini, T. Persson, M. Schwinzerl, E.H. Maclean, K.N. Sjobak, I. Zacharov,

in International Computational Accelerator Physics Conference 2018.

(pdf) (doi)

SixTrack is a single-particle tracking code for high-energy circular accelerators routinely used at CERN for the Large Hadron Collider (LHC), its luminosity upgrade (HL-LHC), the Future Circular Collider (FCC), and the Super Proton Synchrotron (SPS) simulations. The code is based on a 6D symplectic tracking engine, which is optimised for long-term tracking simulations and delivers fully reproducible results on several platforms. It also includes multiple scattering engines for beam-matter interaction studies, as well as facilities to run integrated simulations with FLUKA and GEANT4. These features differentiate SixTrack from general-purpose, optics-design software like MAD-X. The code recently underwent a major restructuring to merge advanced features into a single branch, such as multiple ion species, interface with external codes, and high-performance input/output (XRootD, HDF5). This restructuring also removed a large number of build flags, instead enabling/disabling the functionality at run-time. In the process, the code was moved from Fortran 77 to Fortran 2018 standard, also allowing and achieving a better modularization. Physics models (beam-beam effects, RF-multipoles, current carrying wires, solenoid, and electron lenses) and methods (symplecticity check) have also been reviewed and refined to offer more accurate results. The SixDesk runtime environment allows the user to manage the large batches of simulations required for accurate predictions of the dynamic aperture. SixDesk supports CERN LSF and HTCondor batch systems, as well as the BOINC infrastructure in the framework of the LHC@Home volunteering computing project. SixTrackLib is a new library aimed at providing a portable and flexible tracking engine for single- and multi-particle problems using the models and formalism of SixTrack. The tracking routines are implemented in a parametrized C code that is specialised to run vectorized in CPUs and GPUs, by using SIMD intrinsics, OpenCL 1.2, and CUDA technologies. This contribution presents the status of the code and an outlook on future developments of SixTrack, SixDesk, and SixTrackLib.

Scalable and Performant Graph Processing on GPUs using Approximate Computing

Somesh Singh, Rupesh Nasre,

in IEEE TMSCS 2018.

(pdf) (doi)

Graph algorithms are widely used in several application domains. It has been established that parallelizing graph algorithms is challenging. The parallelization issues get exacerbated when graphics processing units (GPUs) are used to execute graph algorithms. While the prior art has shown effective parallelization of several graph algorithms on GPUs, a few algorithms are still expensive. In this work, we address the scalability issues in graph parallelization. In particular, we aim to improve the execution time by tolerating a little approximation in the computation. We study the effects of four heuristic approximations on six graph algorithms with five graphs and show that if an application allows for small inaccuracy, this can be leveraged to achieve considerable performance benefits. We also study the effects of the approximations on GPU-based processing and provide interesting takeaways.


Other Projects