BOOKQUE

No image available
Interfaces for Efficient Software Composition on Modern Hardware
Shoumik Prasad Palkar
· 2020
For decades, developers have been productive writing software by composing optimized libraries and functions written by other developers. Though hardware trends have evolved significantly over this time--with the ending of Moore's law, the increasing ubiquity of parallelism, and the emergence of new accelerators--many of the common interfaces for composing software have nevertheless remained unchanged since their original design. This lack of evolution is causing serious performance consequences in modern applications. For example, the growing gap between memory and processing speeds means that applications that compose even hand-tuned libraries can spend more time transferring data through main memory between individual function calls than they do performing computations. This problem is even worse for applications that interface with new hardware accelerators such as GPUs. Though application writers can circumvent these bottlenecks manually, these optimizations come at the expense of programmability. In short, the interfaces for composing even optimized software modules are no longer sufficient to best use the resources of modern hardware. This dissertation proposes designing new interfaces for efficient software composition on modern hardware by leveraging algebraic properties intrinsic to software APIs to unlock new optimizations. We demonstrate this idea with three new composition interfaces. The first interface, Weld, uses a functional intermediate representation (IR) to capture the parallel structure of data analytics workloads underneath existing APIs, and enables powerful data movement optimizations over this IR to optimize applications end-to-end. The second, called split annotations (SAs), also focuses on data movement optimization and parallelization, but uses annotations on top of existing functions to define an algebra for specifying how data passed between functions can be partitioned and recombined to enable cross-function pipelining. The third, called raw filtering, optimizes data loading in data-intensive systems by redefining the interface between data parsers and query engines to improve CPU efficiency. Our implementations of these interfaces have shown substantial performance benefits in rethinking the interface between software modules. More importantly, they have also shown the limitations of existing established interfaces. Weld and SAs show that a new interface can accelerate data science pipelines by over 100x in some cases in multicore environments, by enabling data movement optimizations such as pipelining on top of existing libraries such as NumPy and Pandas. We also show that Weld can be used to target new parallel accelerators, such as vector processors and GPUs, and that SAs can enable these speedups even on black-box libraries without any library code modification. Finally, the I/O optimizations in raw filtering show over 9x improvements in end-to-end query execution time in distributed systems such as Spark SQL when processing semi-structured data such as JSON.
No image available
A 3-dimensional Approach to Parallel Matrix Multiplication
International Business Machines Corporation. Research Division , Ramesh C. Agarwal , Susanne M. Balle , Fred G. Gustavson , Mahesh Joshi , Prasad Palkar
· 1995
Abstract: "A 3-dimensional (3-D) matrix multiplication algorithm for massively parallel processing systems is presented. The P processors are configured as a 'virtual' processing cube with dimensions p1, p2, and p3 proportional to the matrices' dimensions -- M, N, and K. Each processor performs a single local matrix multiplication of size M/p1 x N/p2 x K/p3, on one of the sub-cubes in the computational space. Before the local computation can be carried out, each sub-cube needs to receive a single sub-matrix of A and B. After the single matrix multiplication has completed, K/p3 sub-matrices of this product have to be sent to their respective destination processors. The 3-D parallel matrix multiplication approach has a factor P[superscript 1/6] less communication than the 2-D parallel algorithms. This algorithm has been implemented on IBM Powerparallel SP-2 systems (up to 216 nodes) and has yielded close to the peak performance of the machine. The algorithm has been combined with Winograd's variant of Strassen's algorithm to achieve 'super-linear' speed- up. The performance achieved exceeds the theoretical peak of the system."

Interfaces for Efficient Software Composition on Modern Hardware

A 3-dimensional Approach to Parallel Matrix Multiplication