Shuffling in Spark vs Hadoop MapReduce

In the last few posts about Hadoop MapTask spill mechanism, we learn that Hadoop uses an in memory buffer during the map task intermediate output writing phase. As the memory soft limit exceeded, it starts spilling the data to disk. This results in multiple spill files that are eventually merged together at the end of the map task into a big file. During the spilling process, the data in the memory buffer are first sorted by the partitions (each partition corresponds to one Reduce task)  and then by the keys.

As of Spark  1.01 version, Spark Map Tasks write the output directly to disk on completion. There is no use of an in memory buffer. Each Map Task writes as many shuffle files as the number of Reduce Task. One shuffle file per Reduce Task. Eg. one could have 1000 Map Tasks  (M) and 5000 Reduce Tasks (R), this results in 5 millions shuffle files. Spark does not however merge them into a single partitioned shuffle file as in Hadoop MapReduce. This number of file IO somehow affects the performance.

You can learn more about Spark shuffling from the following report.

http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf

In the above report, it also points out the potential memory issue when using compression on the Map output files.

eg. In a machine with C number of cores allocated to Spark, we have C concurrent Map Tasks, each Map Task is writing out R shuffle files, one per Reduce Task. The total number of memory usage would be C*R*400KB.

Here comes the good news. A new shuffling mechanism, called sort-based shuffling is implemented for upcoming Spark version 1.1.0. You can read the design document below and learn more about this issue in SPARK-2045 JIRA ticket.

https://issues.apache.org/jira/secure/attachment/12655884/Sort-basedshuffledesign.pdf

This sort-based shuffling is quite similar but not the same as Hadoop MapReduce shuffling.

The following is the description of SPARK-2045:

“…a sort-based shuffle implementation that takes advantage of an Ordering for keys (or just sorts by hashcode for keys that don’t have it) would likely improve performance and memory usage in very large shuffles. Our current hash-based shuffle needs an open file for each reduce task, which can fill up a lot of memory for compression buffers and cause inefficient IO. This would avoid both of those issues.”

Leave a comment