So far, the discussion on distributed systems has been limited to data storage, and to a few data management primitives (e.g., write(), read(), search(), etc.). For real applications, one also needs to develop and execute more complex programs that process the available datasets and effectively exploit the available resources.
The naive approach that consists in getting all the required data at the Client in order to apply locally some processing, often looses in a distributed setting. First, some processing may not be available locally. Moreover, centralizing all the information then processing it, would simply miss all the advantages brought by a powerful cluster of hundreds or even thousands machines. We have to use distribution. One can consider two main scenarios for data processing in distributed systems.
Distributed processing and workflow: In the first one, an application disposes of large data sets and needs to apply to them some processes that are available on remote sites. When this is the case, the problem is to send the data to the appropriate locations, and then sequence the remote executions. This workflow scenario is typically implemented using Web services and some high-level coordination language.
Distributed data and MapReduce: In a second scenario, the data sets are already distributed in a number of servers, and, conversely to the previous scenario, we “push” programs to these servers. Indeed, due to network bandwidth issues, it is often more cost-effective to send a small piece of program from the Client to Servers, than to transfer large data volumes to a single Client.