MapReduce

MapReduce is a model for processing and analyzing of large datasets across distributed software. It was introduced by Google in 2004 in a paper titled "MapReduce: Simplified Data Processing on Large Clusters". The term "MapReduce" originally referred to Google’s proprietary technology but it has since been generalized to refer to any technologies that implement the original model.

MapReduce is a framework for parallel processing of large datasets across a large number of computers (nodes), collectively referred to as a cluster (if all nodes are in the same local network) or a grid (if the nodes are shared across geographically- and administratively-distributed systems). Processing can be done on data stored in a filesystem or a database.

The programming interface was inspired by the map and reduce functions commonly used in functional programming. In MapReduce, the Map function processes input data and writes the output to temporary storage, while the Reduce function sees worker nodes process each group of output data in parallel. MapReduce adds another main function, Shuffle, which sees worker nodes redistribute data based on the output keys produced by the Map function, such that all data belonging to one key is located on the same worker node. There are other functions, too, but those are main three.

Besides the programming model, MapReduce also specifies an infrastructure system that handles the distribution of tasks, manages data transfers, and provides redundancy and fault tolerance.

Google had a proprietary implementation of MapReduce, which it used to rebuild its search index of the entire World Wide Web. As of 2014, Google is no longer using MapReduce as its primary big data processing model, having moved on to other technologies that offer streaming operations instead of batch processing, allowing integration of "live" search results without needing to rebuild the whole index.

Open-source MapReduce implementations have been written in many programming languages, with varying levels of optimization. The main open-source implementation of MapReduce is Apache Hadoop. Hadoop is a suite of technologies that includes the Hadoop Distributed File System (HDFS) and the Hadoop MapReduce engine.

Apache CouchDB is another implementation. Apache Spark also includes MapReduce-like capabilities, but it has evolved the original MapReduce framework to provide greater flexibility and faster processing.