Cache Management in HDFS

The popularity of Spark has spurred  great interest in providing memory cache capability in HDFS. With the new release of Hadoop 2.3.0, centralized cache management has been introduced.

See the following JIRA.

https://issues.apache.org/jira/browse/HDFS-4949

You can view the design doc here. It is an excellent read. Kudos to both Andrew Wang and Colin Patrick McCabe. They are the instrumental and key contributors of this feature.

This cache memory feature allows memory local HDFS clients to use zero copy read (mlock native calls) to access these memory cached data to further speed up read operation. The mlock and mlockall native calls tell the system to lock to a specified memory range. This prevents the allocated memory from page fault.

You can learn about mlock from the following page.

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Reference_Guide/sect-Realtime_Reference_Guide-Memory_allocation-Using_mlock_to_avoid_memory_faults.html

You can also find the Java native IO mlock and munlock call to memory mapped data in org.apache.hadoop.io.nativeio.NativeIO.java

https://github.com/apache/hadoop-common/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java

and the corresponding C native code
https://github.com/apache/hadoop-common/blob/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/nativeio/NativeIO.c

/**
* Used to manipulate the operating system cache.
*/
@VisibleForTesting
public static class CacheManipulator {
public void mlock(String identifier, ByteBuffer buffer,
long len) throws IOException {
POSIX.mlock(buffer, len);
}

public long getMemlockLimit() {
return NativeIO.getMemlockLimit();
}

public long getOperatingSystemPageSize() {
return NativeIO.getOperatingSystemPageSize();
}

public void posixFadviseIfPossible(String identifier,
FileDescriptor fd, long offset, long len, int flags)
throws NativeIOException {
NativeIO.POSIX.posixFadviseIfPossible(identifier, fd, offset,
len, flags);
}

public boolean verifyCanMlock() {
return NativeIO.isAvailable();
}
}

NameNode has to keep track of the memory cache replicas at different DataNodes with this feature. The cache blocks infos are exchanged between them via heartbeat protocol. As a result, additional metadata e.g.. cachedHosts array is introduced in BlockLocation as seen below.


public class BlockLocation {
  private String[] hosts; // Datanode hostnames
  <strong>private String[] cachedHosts; // Datanode hostnames with a cached replica</strong>
  private String[] names; // Datanode IP:xferPort for accessing the block
  private String[] topologyPaths; // Full path name in network topology
  private long offset;  // Offset of the block in the file
  private long length;
  private boolean corrupt;

Stay tuned for more infos…