Shahram Ghandeharizadeh

Digital Library

ACM Software System Award

USA - 2008

The Gamma Parallel Database System

citation

For Gamma, the first embodiment of a parallel, "shared nothing" database system running on a cluster of commodity computers, using data partitioning and innovative parallel query execution strategies.

The Gamma Database Machine Project built a prototype parallel relational database system from 1984 to 1992 at the University of Wisconsin-Madison. Gamma ran on a cluster of commodity computers (initially Digital VAX 11/750s) connected using standard networking technology. The main contributions of Gamma include the design and implementation of multiple partitioning techniques (round-robin, range, and hash) for distributing the tuples of each relation across multiple machines and disks, along with parallel and pipelined algorithms for executing relational queries in such an environment. Especially relevant were the Gamma join algorithms, including parallel versions of the hybrid-hash join algorithm, the use of sampling techniques for detecting and resolving data skew, the use of bit-vector filters for improving join performance, and the design of novel data replication strategies to minimize the impact of node and disk failures. Through extensive benchmarking, Gamma was the first parallel DBMS to publish results demonstrating practical "linear scalability"; i.e., the ability to run the same query with the same performance on more and more data by simply adding hardware nodes.

While never commercialized, the Gamma Project, through the numerous technical papers produced, had a profound impact on the database field by demonstrating that scalable performance could be achieved without the use of specialized hardware. Nearly all of the industry's successful, shared-nothing parallel database systems have utilized the ideas and algorithms developed and evaluated as part of the Gamma project. Such systems include IBM DB2 Parallel Edition, Informix Version 8, Tandem Non-stop SQL, Vertica, Netezza, DATAllegro (Microsoft), Greenplum, Aster Data, and ParAccel. Most of these products are directed at the data warehouse and business intelligence market, estimated by the Gartner Group at $30B per year. In addition, the parallel, partitioned, dataflow computational pattern that made Gamma so scalable is currently undergoing a resurgence in the context of general-purpose, scalable, data-intensive computing frameworks like MapReduce from Google and Hadoop from Yahoo.