I have recently digged into Spark trying to understand its internals. I found some excellent interesting papers especially on RDD (Resilient Distributed Dataset). For example:
https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf (Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing)
For some of us who are more familiar with Hadoop disk based distributed system might also want to read up on distributed shared memory (DSM) to gain some basic understandings of various related concepts.
http://www.cdf.toronto.edu/~csc469h/fall/handouts/nitzberg91.pdf (Distributed Shared Memory: A survey of issues and algorithms) This is a good overview of DSM. It talks about memory coherence and coherence protocol.
http://hal.archives-ouvertes.fr/docs/00/07/41/93/PDF/RR-2481.pdf (A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability)
RDD is different from DSM in the following aspects as pointed out in the above paper. See the table below. RDD is a restricted form of distributed shared memory. It is immutable, partitioned collections of records. It maintains lineage information (a series of transformations) for efficient fault recovery.
Table 1 taken from the 1st paper above
To be continued…..