Mahout Distance Measure

In Mahout, there are different distance measure classes defined in the org.apache.mahout.common.distance package. Among them are SquaredEuclideanDistanceMeasure, EuclideanDistanceMeasure, CosineDistanceMeasure, MahalanobisDistanceMeasure, ManhattanDistanceMeasure. These distance classes implements the DistanceMeasure interface.

There is also support for weighted version of DistanceMeasure eg. WeightedEuclideanDistanceMeasure and WeightedManhattanDistanceMeasure. Both extends the abstract class WeightedDistanceMeasure.

DistanceMeasure

DistanceMeasure Mahout

WeightedDistanceMeasure

WeightedDistanceMeasure

HBase 0.96 API change

I was coding DAO for HBase 0.96 the other day and found out some API changes from 0.94 to 0.96.

The introduction of new Cell interface and also KeyValue is no longer implementing Writable interface. Instead it implements Cell interface. With protobuf as the wire format, Scan, Get, Put are also no longer required to implement Writable. They extends abstract class OperationWithAttributes which extends another abstract class Operation and implements Attributes interface. The abstract class Operation mainly provides JSON conversion for logging/debugging purpose. The Attributes interface defines the methods for setting and maintaining Map of attributes of the Operation.

 

public interface Cell {

  //1) Row

  /**
   * Contiguous raw bytes that may start at any index in the containing array. Max length is
   * Short.MAX_VALUE which is 32,767 bytes.
   * @return The array containing the row bytes.
   */
  byte[] getRowArray();

  /**
   * @return Array index of first row byte
   */
  int getRowOffset();

  /**
   * @return Number of row bytes. Must be < rowArray.length - offset.
   */
  short getRowLength();


  //2) Family

  /**
   * Contiguous bytes composed of legal HDFS filename characters which may start at any index in the
   * containing array. Max length is Byte.MAX_VALUE, which is 127 bytes.
   * @return the array containing the family bytes.
   */
  byte[] getFamilyArray();

  /**
   * @return Array index of first family byte
   */
  int getFamilyOffset();

  /**
   * @return Number of family bytes.  Must be < familyArray.length - offset.
   */
  byte getFamilyLength();


  //3) Qualifier

  /**
   * Contiguous raw bytes that may start at any index in the containing array. Max length is
   * Short.MAX_VALUE which is 32,767 bytes.
   * @return The array containing the qualifier bytes.
   */
  byte[] getQualifierArray();

  /**
   * @return Array index of first qualifier byte
   */
  int getQualifierOffset();

  /**
   * @return Number of qualifier bytes.  Must be < qualifierArray.length - offset.
   */
  int getQualifierLength();


  //4) Timestamp

  /**
   * @return Long value representing time at which this cell was "Put" into the row.  Typically
   * represents the time of insertion, but can be any value from 0 to Long.MAX_VALUE.
   */
  long getTimestamp();


  //5) Type

  /**
   * @return The byte representation of the KeyValue.TYPE of this cell: one of Put, Delete, etc
   */
  byte getTypeByte();


  //6) MvccVersion

  /**
   * Internal use only. A region-specific sequence ID given to each operation. It always exists for
   * cells in the memstore but is not retained forever. It may survive several flushes, but
   * generally becomes irrelevant after the cell's row is no longer involved in any operations that
   * require strict consistency.
   * @return mvccVersion (always >= 0 if exists), or 0 if it no longer exists
   */
  long getMvccVersion();


  //7) Value

  /**
   * Contiguous raw bytes that may start at any index in the containing array. Max length is
   * Integer.MAX_VALUE which is 2,147,483,648 bytes.
   * @return The array containing the value bytes.
   */
  byte[] getValueArray();

  /**
   * @return Array index of first value byte
   */
  int getValueOffset();

  /**
   * @return Number of value bytes.  Must be < valueArray.length - offset.
   */
  int getValueLength();
  
  /**
   * @return the tags byte array
   */
  byte[] getTagsArray();

  /**
   * @return the first offset where the tags start in the Cell
   */
  int getTagsOffset();
  
  /**
   * @return the total length of the tags in the Cell.
   */
  short getTagsLength();
  
  /**
   * WARNING do not use, expensive.  This gets an arraycopy of the cell's value.
   *
   * Added to ease transition from  0.94 -> 0.96.
   * 
   * @deprecated as of 0.96, use {@link CellUtil#cloneValue(Cell)}
   */
  @Deprecated
  byte[] getValue();
  
  /**
   * WARNING do not use, expensive.  This gets an arraycopy of the cell's family. 
   *
   * Added to ease transition from  0.94 -> 0.96.
   * 
   * @deprecated as of 0.96, use {@link CellUtil#cloneFamily(Cell)}
   */
  @Deprecated
  byte[] getFamily();

  /**
   * WARNING do not use, expensive.  This gets an arraycopy of the cell's qualifier.
   *
   * Added to ease transition from  0.94 -> 0.96.
   * 
   * @deprecated as of 0.96, use {@link CellUtil#cloneQualifier(Cell)}
   */
  @Deprecated
  byte[] getQualifier();

  /**
   * WARNING do not use, expensive.  this gets an arraycopy of the cell's row.
   *
   * Added to ease transition from  0.94 -> 0.96.
   * 
   * @deprecated as of 0.96, use {@link CellUtil#getRowByte(Cell, int)}
   */
  @Deprecated
  byte[] getRow();
}

Example: 0.96

   for(Cell cell: result.rawCells()) {

}

vs  0.94

for(KeyValue kv: result.raw()) {

}

0.96

ArrayList<Cell> list = new ArrayList<Cell>();

Result result = Result.create(list);

vs

0.94

ArrayList<KeyValue> list = new ArrayList<KeyValue>();

list.add(keyValue);

Result result = new Result(list);

 

 

Different Maven Profiles for Hadoop 1 and Hadoop 2

Since we have different versions of Hadoop, sometimes we have to compile/deploy our codes against one but not the other. The ideal way is to setup profiles in maven to compile and create the artifacts accordingly.

Here is the profiles section defined in HBase project. You can create the same profiles in your pom.

To compile against one of these profiles,
mvn -Dhadoop.profile=1.1 clean compile

<profiles>
		<profile>
			<id>hadoop-1.1</id>
			<activation>
				<property>
					<name>hadoop.profile</name>
					<value>1.1</value>
				</property>
			</activation>
			<dependencies>
				<dependency>
					<groupId>org.apache.hadoop</groupId>
					<artifactId>hadoop-core</artifactId>
					<version>${hadoop-1.version}</version>
				</dependency>
				<dependency>
					<groupId>org.apache.hadoop</groupId>
					<artifactId>hadoop-test</artifactId>
					<version>${hadoop-1.version}</version>
				</dependency>
			</dependencies>
		</profile>
		<profile>
			<id>hadoop-1.0</id>
			<activation>
				<property>
					<name>hadoop.profile</name>
					<value>1.0</value>
				</property>
			</activation>
			<dependencies>
				<dependency>
					<groupId>org.apache.hadoop</groupId>
					<artifactId>hadoop-core</artifactId>
					<version>${hadoop-1.version}</version>
				</dependency>
				<dependency>
					<groupId>org.apache.hadoop</groupId>
					<artifactId>hadoop-test</artifactId>
					<version>${hadoop-1.version}</version>
				</dependency>
			</dependencies>
		</profile>
		<profile>
			<id>hadoop-2.0</id>
			<activation>
				<property>
					<name>!hadoop.profile</name>
				</property>
			</activation>
			<dependencyManagement>
				<dependencies>
					<dependency>
						<groupId>org.apache.hadoop</groupId>
						<artifactId>hadoop-mapreduce-client-core</artifactId>
						<version>${hadoop-2.version}</version>
					</dependency>
					<dependency>
						<groupId>org.apache.hadoop</groupId>
						<artifactId>hadoop-mapreduce-client-jobclient</artifactId>
						<version>${hadoop-2.version}</version>
					</dependency>
					<dependency>
						<groupId>org.apache.hadoop</groupId>
						<artifactId>hadoop-mapreduce-client-jobclient</artifactId>
						<version>${hadoop-2.version}</version>
						<type>test-jar</type>
						<scope>test</scope>
					</dependency>
					<dependency>
						<groupId>org.apache.hadoop</groupId>
						<artifactId>hadoop-hdfs</artifactId>
						<exclusions>
							<exclusion>
								<groupId>javax.servlet.jsp</groupId>
								<artifactId>jsp-api</artifactId>
							</exclusion>
							<exclusion>
								<groupId>javax.servlet</groupId>
								<artifactId>servlet-api</artifactId>
							</exclusion>
							<exclusion>
								<groupId>stax</groupId>
								<artifactId>stax-api</artifactId>
							</exclusion>
						</exclusions>
						<version>${hadoop-2.version}</version>
					</dependency>
					<dependency>
						<groupId>org.apache.hadoop</groupId>
						<artifactId>hadoop-hdfs</artifactId>
						<version>${hadoop-2.version}</version>
						<type>test-jar</type>
						<scope>test</scope>
						<exclusions>
							<exclusion>
								<groupId>javax.servlet.jsp</groupId>
								<artifactId>jsp-api</artifactId>
							</exclusion>
							<exclusion>
								<groupId>javax.servlet</groupId>
								<artifactId>servlet-api</artifactId>
							</exclusion>
							<exclusion>
								<groupId>stax</groupId>
								<artifactId>stax-api</artifactId>
							</exclusion>
						</exclusions>
					</dependency>
					<dependency>
						<groupId>org.apache.hadoop</groupId>
						<artifactId>hadoop-auth</artifactId>
						<version>${hadoop-2.version}</version>
					</dependency>
					<dependency>
						<groupId>org.apache.hadoop</groupId>
						<artifactId>hadoop-common</artifactId>
						<version>${hadoop-2.version}</version>
						<exclusions>
							<exclusion>
								<groupId>javax.servlet.jsp</groupId>
								<artifactId>jsp-api</artifactId>
							</exclusion>
							<exclusion>
								<groupId>javax.servlet</groupId>
								<artifactId>servlet-api</artifactId>
							</exclusion>
							<exclusion>
								<groupId>stax</groupId>
								<artifactId>stax-api</artifactId>
							</exclusion>
						</exclusions>
					</dependency>
					<dependency>
						<groupId>org.apache.hadoop</groupId>
						<artifactId>hadoop-client</artifactId>
						<version>${hadoop-2.version}</version>
					</dependency>
					<dependency>
						<groupId>org.apache.hadoop</groupId>
						<artifactId>hadoop-annotations</artifactId>
						<version>${hadoop-2.version}</version>
					</dependency>
					<!-- This was marked as test dep in earlier pom, but was scoped compile. 
						Where do we actually need it? -->
					<dependency>
						<groupId>org.apache.hadoop</groupId>
						<artifactId>hadoop-minicluster</artifactId>
						<version>${hadoop-2.version}</version>
						<exclusions>
							<exclusion>
								<groupId>javax.servlet.jsp</groupId>
								<artifactId>jsp-api</artifactId>
							</exclusion>
							<exclusion>
								<groupId>javax.servlet</groupId>
								<artifactId>servlet-api</artifactId>
							</exclusion>
							<exclusion>
								<groupId>stax</groupId>
								<artifactId>stax-api</artifactId>
							</exclusion>
						</exclusions>
					</dependency>
				</dependencies>
			</dependencyManagement>
		</profile>
	</profiles>

YARN log files

Viewing MapReduce job log files has been a pain. With YARN, you can enable the log aggregation. This will pull and aggregate all the individual logs belonging to a MR job and allow one to view the aggregated log with the following command.

You can view your MapReduce job log files using the following command

yarn logs -applicationId  <YOUR_APPLICATION_ID>  |less

eg.

yarn logs -applicationId  application_1399561129519_4422 |less

You can enable the log aggregation in the yarn-site.xml as follows

cat /etc/hadoop/conf/yarn-site.xml

<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>

To see the list of running MapReduce jobs

mapred job -list

To check the status of a MapReduce job

mapped job -status <YOUR_JOB_ID>