|
|
|
|
|
Algorithmic Aspects of Parallel Data Processing
Author(s): Paraschos Koutris;Semih Salihoglu;Dan Suciu
Source: Journal:Foundations and Trends® in Databases ISSN Print:1931-7883, ISSN Online:1931-7891 Publisher:Now Publishers Volume 8 Number 4, Pages: 135(239-370) DOI: 10.1561/1900000055
Abstract:
In the last decade or so we have witnessed a growing interest in processing
large data sets on large distributed clusters. The idea was pioneered
by the MapReduce framework, and has been widely adopted by several
other systems, including PigLatin, Hive, Scope, U-SQL, Dremmel,
Spark and Myria. A large part of the complex data analysis performed
by these systems consists of a sequence of relatively simple query operations,
such as joining two or more tables. This survey discusses recent
algorithmic developments for distributed data processing. It uses
a theoretical model of parallel processing called the Massively Parallel
Computation (MPC) model, which is a simplification of the BSP
model where the only cost is given by the amount of communication
and the number of communication rounds. The survey studies several
algorithms for multi-join queries, for sorting, and for matrix multiplication,
and discusses their relationships and common techniques applied
across the different data processing tasks.
|
|
|
|