This paper introduces Spark as a framework to run large-scale, data-intensive workloads on commodity hardware. It is especially useful for workflows that are currently difficult to implement or expensive to run on Hadoop MapReduce [ref:H_MAPREDUCE].
MapReduce [ref:MapReduce] is not well suited for iterative jobs and interactive analytics because it forces the materialization of results between jobs (generally to HDFS [ref:H_HDFS]). In iterative jobs, keeping the data between iteration in RAM results in a significant speedup. For interactive jobs, consecutive ad-hoc queries can often be answered by the same working-set of data that could be kept in memory between queries (MapReduce forces every query to read/write data to HDFS).
The main concepts that Spark introduces are resilient distributed datasets (RDDs) and lineage. RDDs are collections of read-only objects partitioned over a set of machines. They limit data transformation (to: map, filter, sample, reduce, ...) and track how the partition sets relate to each other through their lineage. If a partition is lost, the lineage allows to efficiently re-compute the data of that partition.
RDDs are manipulated as a handle representing the state of data from its initial value through a sequence of transformations. Assuming input that does not change and no randomness in the process, you can consider a RDD handle a representation of data at a specific point in a sequence of transformations. For example, the input can be an HDFS file that contains a list of integers, the lineage the sequence of operations: "adding one to every number then squaring". Even if you lose data during the process, as long as you can recover the input file and you have the lineage, you can re-create the data at any point in the sequence of transformation. You can consider the RDD handle as a representation of constant,
RDDs are meant to be used in the context of distributed computations. The basic data structure and tools to express operations are parallelizable: the input data is expected to be partitioned and sent to separate nodes for further computations. This does limit the possible operations and forces you to think inside the framework they provide. If a RDD is split on two nodes, multiplying the first item of the collection by the last item would require merging the data from those two nodes in order to make the operation possible.
In order to speed up operation, you can give persistence hints: cache and save. Cache will keep the data in memory after its first computation, and save will save the intermediate result to HDFS. Note that those are only hints: in some circumstances (such as lack of resource) Spark will ignore the hints.
The operations that can be done on RDDs are the typical suspects from functional programming: map(), filter(), reduce() and the like. Once the sequence of transformation is done, collect() is called to return the information to the driver program. One example (slightly simplified here) given in the paper is:
spark.textFile("...file path...").filter(_.contains("ERROR")).map(_ => 1).reduce(_ + _).collect()
Note that here `_` refers to Scala syntax why `_` is a placeholder. For this example, you can just think of `_` as variables such as x, y or z in other languages.
If we unpack this, the text file is read (different chunks on different nodes) and split by lines. Then, filter returns only the lines which contain the string "ERROR". map( => 1) transforms all lines into the integer 1, then .reduce( + _) instructs to add all integers. The output will be the number of lines in the input files that contained at least on "ERROR" string.
The powerful concept here is that Spark allows you to express a sequence of transformation just as you would in a functional language, but it will work on an arbitrarily large input dataset, split it into nodes so efficiently do the computation in parallel, will automatically recover any crashing nodes along the way and product the final output.
This is a very interesting paper that introduces the fundamental building blocks of Spark. However, is it pretty thin on details and is a snapshot of what Spark was in 2010. I recommend reading "Resilient Distributed Datasets: A Fault-Tolerant Abstraction forIn-Memory Cluster Computing", which came out in 2012, is more detailed, contains more complete examples, nice graphics and nice table of transformation available on RDDs.