Commands to Run MapReduce Job

Posted on June 7, 2012 by BigData Explorer

ant build your mapreduce job jar first. See the following ant build.xml

</path>

</target>

<javac srcdir=”${src.dir}” destdir=”${tmp.dir}” debug=”on”

classpathref=”classpath”/>

</target>

</copy>

</target>

<!– <manifest>

</manifest>

–>

</fileset>

</jar>

</target>

</project>

./bin/hadoop jar YOUR_MAP_REDUCE_JOB.jar com.your_xxx_domain.path.to.your.class.MyMapReduceClass programArgs1 programArgs2

SequenceFileValueIterator

Posted on May 20, 2012 by BigData Explorer

org.apache.mahout.common.iterator.sequencefile.SequenceFileValueIterator

Using this class, we can iterate through the values in the sequence file. Key is expected to be of Writable.

Class<? extends Writable> keyClass = (Class<? extends Writable>) reader.getKeyClass();

Mahout distance measure

Posted on May 19, 2012 by BigData Explorer

All the distance measure classes are in org.apache.mahout.common.distance package.

They all implements the DistanceMeasure interface which also extends the Parametered interface.

There are two methods in the DistanceMeasure interface:

double distance(Vector v1, Vector v2)

double distance(double centroidLengthSquare, Vector centroid, Vector v)

You can create a new distance measure by implementing the DistanceMeasure interface 🙂

Mahout clustering

Posted on May 18, 2012 by BigData Explorer

I have been exploring the mahout clustering packages. Some brief description of the codes

org.apache.mahout.clustering.kmeans.KMeansDriver is the main entry point. The run method setup the clustering job by specifying the input path to data points, path to initial k clusters, distance measure class used.

In the KMeansDriver, it also setup the settings in the Configuration as follows:

conf.set(KMeansConfigKeys.CLUSTER_PATH_KEY, clustersIn.toString());
conf.set(KMeansConfigKeys.DISTANCE_MEASURE_KEY, measureClass);
conf.set(KMeansConfigKeys.CLUSTER_CONVERGENCE_KEY, convergenceDelta);

KMeansConfigKeys is an interface which defines the config keys for the settings.

org.apache.mahout.clustering.kmeans.KMeansMapper loads the initial predefined clusters via the setup method. In the map method, it reads the VectorWritable and passes it to the KMeansCluster to do the clustering, comparing against the predefined clusters via the following call.

this.clusterer.emitPointToNearestCluster(point.get(), this.clusters, context);

In the emitPointToNeareastCluster, it finds the nearest cluster by comparing the distance to each cluster in the list. It then writes the cluster identifier and wraps the point in the ClusterObservations to HDFS.

context.write(new Text(nearestCluster.getIdentifier()), new ClusterObservations(1, point, point.times(point)));

In the KMeansReducer, it aggregates the list of ClusterObservations into a Cluster before writing it the Cluster id as key and Cluster as the value.

In the setup method, KMeansReducer loads the predefined initial cluster from HDFS path and creates the ClusterMap of key as cluster id and value as Cluster itself.

Monotonically increasing key values are bad

Posted on May 13, 2011 by BigData Explorer

http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/

Papers on NoSQL benchmarking

Posted on May 13, 2011 by BigData Explorer

http://www.nosqlbenchmarking.com/wp-content/uploads/2011/05/paper.pdf

Avro MapReduce Example

Posted on May 9, 2011 by BigData Explorer

https://github.com/rbodkin/avro-mr-sample (by Ron Bodkin)

Maven+SVN+Eclipse

Posted on April 29, 2011 by BigData Explorer

Check out the following tutorial on checking out maven project from SVN in Eclipse

http://ykyuen.wordpress.com/2009/11/19/check-out-a-maven-project-from-svn-in-eclipse/

NoSQL introduction

Posted on April 13, 2011 by BigData Explorer

http://www.christof-strauch.de/nosqldbs.pdf

Different file formats used in Hadoop and HBase

Posted on March 26, 2011 by BigData Explorer

I have been investigating different file formats used in Hadoop and HBase to understand how these file formats assist in the speedup that we’ve all witnessed in this Hadoop big data world. Also, I recommend all java developer to dig into Hadoop and HBase source code because you will definitely learn a lot and improve your java skills.

File formats used in Hadoop are SequenceFile, TFile, and Avro file whereas HFile is used exclusively in HBase.

I found an interesting and detailed explanation of the internal structure of HFile representation in the following blog.

http://cloudepr.blogspot.com/2009/09/hfile-block-indexed-file-format-to.html

Enjoy

BigDataExplorer

Wei Shung Chung

Wei Shung Chung – Hadoop, HBase, MapReduce, Spark, Spark ML, Machine Learning, Deep Learning

Category Archives: Big Data

Commands to Run MapReduce Job

SequenceFileValueIterator

Mahout distance measure

Mahout clustering

Monotonically increasing key values are bad

Papers on NoSQL benchmarking

Avro MapReduce Example

Maven+SVN+Eclipse

NoSQL introduction

Different file formats used in Hadoop and HBase