Machine Learning based Trading Strategy

Posted on July 31, 2014 by BigData Explorer

In the last few years, I have built a machine learning based trading strategies platform to predict the stock market. It provides model building, training, backtesting, and prediction functionality built on top of big data Hadoop and HBase. The secret sauce remains a secret 🙂

You can follow my model at

http://giantify.collective2.com/

RDD in Spark

Posted on July 31, 2014 by BigData Explorer

I have recently digged into Spark trying to understand its internals. I found some excellent interesting papers especially on RDD (Resilient Distributed Dataset). For example:

https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf (Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing)

http://www.cs.berkeley.edu/~matei/talks/2012/nsdi_rdds.pdf (Presentation)

For some of us who are more familiar with Hadoop disk based distributed system might also want to read up on distributed shared memory (DSM) to gain some basic understandings of various related concepts.

http://www.cdf.toronto.edu/~csc469h/fall/handouts/nitzberg91.pdf (Distributed Shared Memory: A survey of issues and algorithms) This is a good overview of DSM. It talks about memory coherence and coherence protocol.

http://hal.archives-ouvertes.fr/docs/00/07/41/93/PDF/RR-2481.pdf (A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability)

RDD is different from DSM in the following aspects as pointed out in the above paper. See the table below. RDD is a restricted form of distributed shared memory. It is immutable, partitioned collections of records. It maintains lineage information (a series of transformations) for efficient fault recovery.

Table 1 taken from the 1st paper above

To be continued…..

HyperLogLog vs LogLog

Posted on July 30, 2014 by BigData Explorer

To understand HyperLogLog (HLL), one should read the LogLog and HyperLogLog papers.

http://algo.inria.fr/flajolet/Publications/DuFl03-LNCS.pdf (LogLog)

http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf (HyperLogLog)

The major difference between the two is one uses regular mean and the other (HLL) uses harmonic mean to average the estimate cardinality calculated by different m buckets. I would view these buckets as different experiments.

The accuracy (standard error) is controlled by the number of buckets, m. The more buckets the better is the estimation. But, HLL has better accuracy compared to LogLog given the same number of buckets.

The standard error for HLL is

The standard error for LogLog is

The following is the LogLog algorithm taken from the paper

Basically, there are m buckets, each one is keeping track of the run of consecutive zeros in the bit string. We then figure out the max run of consecutive zeros (position of leftmost 1) in all the buckets.

eg. 1 => 1, 001 =>3

Then using the cardinality estimate formula shown in the last line, calculate the estimate unique counts. As you see from the formula, it involves taking the average of the run of zeros observed in these buckets. (Note: the alpha is the bias correction applied to the cardinality estimate)

For HyperLogLog, instead of using the average, harmonic mean (stochastic averaging) is used. Please refer to the HLL paper for more detailed description of the formula.

Taken from the HLL paper,

Spark on YARN

Posted on July 29, 2014 by BigData Explorer

I tried out Spark on YARN recently. It is very straight forward. If you are planning to use MLlib, make sure you have gcc-gfortran installed on every cluster node.

1) First, you have to download and install spark-1.0.0-bin-hadoop2

http://www.apache.org/dyn/closer.cgi/spark/spark-1.0.1/spark-1.0.1-bin-hadoop2.tgz

2) Set the YARN_CONF_DIR to point to the location of your hadoop configuration files.

eg.

export YARN_CONF_DIR=/etc/hadoop/conf

You can now start spark shell as follows

cd spark-1.0.0-bin-hadoop2

./bin/spark-shell —master yarn-client –driver-java-options “-Dspark.executor.memory=2g”

HyperLogLog Aproximate Distinct Counting

Posted on July 6, 2014 by BigData Explorer

In MapReduce, distinct/unique count of large data set is very common but unfortunately it is not scalable because it requires one reducer. It gets worse when you have to perform unique count across different aggregation/segment groups. Again, unique counting across different aggregation granularities whether in terms of time dimension or in combination with other demographic attributes is a common practice in big data analytics. Eg. In your massive data sets, let’s find unique counts of users for every hour, every day, every week, every month and etc.

HyperLogLog approximate unique counting can be used to provide a scalable solution. Keep in mind, the unique count here is a probabilistic approximate counts.You can implement either a map side hyperloglog counting or reduce side hyperloglog counting. However, map side counting requires more memory and beware of out of memory issue if you have many segments to perform unique count on.

For more information, you can check out the following pages

http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html

http://stefanheule.com/papers/edbt2013-hyperloglog.pdf

http://tech.adroll.com/media/hllminhash.pdf

http://www.looker.com/news/blog/practical-data-science-amazon-announces-hyperloglog

Mahout Distance Measure

Posted on May 20, 2014 by BigData Explorer

In Mahout, there are different distance measure classes defined in the org.apache.mahout.common.distance package. Among them are SquaredEuclideanDistanceMeasure, EuclideanDistanceMeasure, CosineDistanceMeasure, MahalanobisDistanceMeasure, ManhattanDistanceMeasure. These distance classes implements the DistanceMeasure interface.

There is also support for weighted version of DistanceMeasure eg. WeightedEuclideanDistanceMeasure and WeightedManhattanDistanceMeasure. Both extends the abstract class WeightedDistanceMeasure.

DistanceMeasure Mahout

WeightedDistanceMeasure

HBase 0.96 API change

Posted on May 20, 2014 by BigData Explorer

I was coding DAO for HBase 0.96 the other day and found out some API changes from 0.94 to 0.96.

The introduction of new Cell interface and also KeyValue is no longer implementing Writable interface. Instead it implements Cell interface. With protobuf as the wire format, Scan, Get, Put are also no longer required to implement Writable. They extends abstract class OperationWithAttributes which extends another abstract class Operation and implements Attributes interface. The abstract class Operation mainly provides JSON conversion for logging/debugging purpose. The Attributes interface defines the methods for setting and maintaining Map of attributes of the Operation.

public interface Cell {

  //1) Row

  /**
   * Contiguous raw bytes that may start at any index in the containing array. Max length is
   * Short.MAX_VALUE which is 32,767 bytes.
   * @return The array containing the row bytes.
   */
  byte[] getRowArray();

  /**
   * @return Array index of first row byte
   */
  int getRowOffset();

  /**
   * @return Number of row bytes. Must be < rowArray.length - offset.
   */
  short getRowLength();


  //2) Family

  /**
   * Contiguous bytes composed of legal HDFS filename characters which may start at any index in the
   * containing array. Max length is Byte.MAX_VALUE, which is 127 bytes.
   * @return the array containing the family bytes.
   */
  byte[] getFamilyArray();

  /**
   * @return Array index of first family byte
   */
  int getFamilyOffset();

  /**
   * @return Number of family bytes.  Must be < familyArray.length - offset.
   */
  byte getFamilyLength();


  //3) Qualifier

  /**
   * Contiguous raw bytes that may start at any index in the containing array. Max length is
   * Short.MAX_VALUE which is 32,767 bytes.
   * @return The array containing the qualifier bytes.
   */
  byte[] getQualifierArray();

  /**
   * @return Array index of first qualifier byte
   */
  int getQualifierOffset();

  /**
   * @return Number of qualifier bytes.  Must be < qualifierArray.length - offset.
   */
  int getQualifierLength();


  //4) Timestamp

  /**
   * @return Long value representing time at which this cell was "Put" into the row.  Typically
   * represents the time of insertion, but can be any value from 0 to Long.MAX_VALUE.
   */
  long getTimestamp();


  //5) Type

  /**
   * @return The byte representation of the KeyValue.TYPE of this cell: one of Put, Delete, etc
   */
  byte getTypeByte();


  //6) MvccVersion

  /**
   * Internal use only. A region-specific sequence ID given to each operation. It always exists for
   * cells in the memstore but is not retained forever. It may survive several flushes, but
   * generally becomes irrelevant after the cell's row is no longer involved in any operations that
   * require strict consistency.
   * @return mvccVersion (always >= 0 if exists), or 0 if it no longer exists
   */
  long getMvccVersion();


  //7) Value

  /**
   * Contiguous raw bytes that may start at any index in the containing array. Max length is
   * Integer.MAX_VALUE which is 2,147,483,648 bytes.
   * @return The array containing the value bytes.
   */
  byte[] getValueArray();

  /**
   * @return Array index of first value byte
   */
  int getValueOffset();

  /**
   * @return Number of value bytes.  Must be < valueArray.length - offset.
   */
  int getValueLength();
  
  /**
   * @return the tags byte array
   */
  byte[] getTagsArray();

  /**
   * @return the first offset where the tags start in the Cell
   */
  int getTagsOffset();
  
  /**
   * @return the total length of the tags in the Cell.
   */
  short getTagsLength();
  
  /**
   * WARNING do not use, expensive.  This gets an arraycopy of the cell's value.
   *
   * Added to ease transition from  0.94 -> 0.96.
   * 
   * @deprecated as of 0.96, use {@link CellUtil#cloneValue(Cell)}
   */
  @Deprecated
  byte[] getValue();
  
  /**
   * WARNING do not use, expensive.  This gets an arraycopy of the cell's family. 
   *
   * Added to ease transition from  0.94 -> 0.96.
   * 
   * @deprecated as of 0.96, use {@link CellUtil#cloneFamily(Cell)}
   */
  @Deprecated
  byte[] getFamily();

  /**
   * WARNING do not use, expensive.  This gets an arraycopy of the cell's qualifier.
   *
   * Added to ease transition from  0.94 -> 0.96.
   * 
   * @deprecated as of 0.96, use {@link CellUtil#cloneQualifier(Cell)}
   */
  @Deprecated
  byte[] getQualifier();

  /**
   * WARNING do not use, expensive.  this gets an arraycopy of the cell's row.
   *
   * Added to ease transition from  0.94 -> 0.96.
   * 
   * @deprecated as of 0.96, use {@link CellUtil#getRowByte(Cell, int)}
   */
  @Deprecated
  byte[] getRow();
}

Example: 0.96

for(Cell cell: result.rawCells()) {

}

vs 0.94

for(KeyValue kv: result.raw()) {

}

0.96

ArrayList<Cell> list = new ArrayList<Cell>();

Result result = Result.create(list);

0.94

ArrayList<KeyValue> list = new ArrayList<KeyValue>();

list.add(keyValue);

Result result = new Result(list);

Different Maven Profiles for Hadoop 1 and Hadoop 2

Posted on May 11, 2014 by BigData Explorer

Since we have different versions of Hadoop, sometimes we have to compile/deploy our codes against one but not the other. The ideal way is to setup profiles in maven to compile and create the artifacts accordingly.

Here is the profiles section defined in HBase project. You can create the same profiles in your pom.

To compile against one of these profiles,
mvn -Dhadoop.profile=1.1 clean compile

<profiles>
		<profile>
			<id>hadoop-1.1</id>
			<activation>
				<property>
					<name>hadoop.profile</name>
					<value>1.1</value>
				</property>
			</activation>
			<dependencies>
				<dependency>
					<groupId>org.apache.hadoop</groupId>
					<artifactId>hadoop-core</artifactId>
					<version>${hadoop-1.version}</version>
				</dependency>
				<dependency>
					<groupId>org.apache.hadoop</groupId>
					<artifactId>hadoop-test</artifactId>
					<version>${hadoop-1.version}</version>
				</dependency>
			</dependencies>
		</profile>
		<profile>
			<id>hadoop-1.0</id>
			<activation>
				<property>
					<name>hadoop.profile</name>
					<value>1.0</value>
				</property>
			</activation>
			<dependencies>
				<dependency>
					<groupId>org.apache.hadoop</groupId>
					<artifactId>hadoop-core</artifactId>
					<version>${hadoop-1.version}</version>
				</dependency>
				<dependency>
					<groupId>org.apache.hadoop</groupId>
					<artifactId>hadoop-test</artifactId>
					<version>${hadoop-1.version}</version>
				</dependency>
			</dependencies>
		</profile>
		<profile>
			<id>hadoop-2.0</id>
			<activation>
				<property>
					<name>!hadoop.profile</name>
				</property>
			</activation>
			<dependencyManagement>
				<dependencies>
					<dependency>
						<groupId>org.apache.hadoop</groupId>
						<artifactId>hadoop-mapreduce-client-core</artifactId>
						<version>${hadoop-2.version}</version>
					</dependency>
					<dependency>
						<groupId>org.apache.hadoop</groupId>
						<artifactId>hadoop-mapreduce-client-jobclient</artifactId>
						<version>${hadoop-2.version}</version>
					</dependency>
					<dependency>
						<groupId>org.apache.hadoop</groupId>
						<artifactId>hadoop-mapreduce-client-jobclient</artifactId>
						<version>${hadoop-2.version}</version>
						<type>test-jar</type>
						<scope>test</scope>
					</dependency>
					<dependency>
						<groupId>org.apache.hadoop</groupId>
						<artifactId>hadoop-hdfs</artifactId>
						<exclusions>
							<exclusion>
								<groupId>javax.servlet.jsp</groupId>
								<artifactId>jsp-api</artifactId>
							</exclusion>
							<exclusion>
								<groupId>javax.servlet</groupId>
								<artifactId>servlet-api</artifactId>
							</exclusion>
							<exclusion>
								<groupId>stax</groupId>
								<artifactId>stax-api</artifactId>
							</exclusion>
						</exclusions>
						<version>${hadoop-2.version}</version>
					</dependency>
					<dependency>
						<groupId>org.apache.hadoop</groupId>
						<artifactId>hadoop-hdfs</artifactId>
						<version>${hadoop-2.version}</version>
						<type>test-jar</type>
						<scope>test</scope>
						<exclusions>
							<exclusion>
								<groupId>javax.servlet.jsp</groupId>
								<artifactId>jsp-api</artifactId>
							</exclusion>
							<exclusion>
								<groupId>javax.servlet</groupId>
								<artifactId>servlet-api</artifactId>
							</exclusion>
							<exclusion>
								<groupId>stax</groupId>
								<artifactId>stax-api</artifactId>
							</exclusion>
						</exclusions>
					</dependency>
					<dependency>
						<groupId>org.apache.hadoop</groupId>
						<artifactId>hadoop-auth</artifactId>
						<version>${hadoop-2.version}</version>
					</dependency>
					<dependency>
						<groupId>org.apache.hadoop</groupId>
						<artifactId>hadoop-common</artifactId>
						<version>${hadoop-2.version}</version>
						<exclusions>
							<exclusion>
								<groupId>javax.servlet.jsp</groupId>
								<artifactId>jsp-api</artifactId>
							</exclusion>
							<exclusion>
								<groupId>javax.servlet</groupId>
								<artifactId>servlet-api</artifactId>
							</exclusion>
							<exclusion>
								<groupId>stax</groupId>
								<artifactId>stax-api</artifactId>
							</exclusion>
						</exclusions>
					</dependency>
					<dependency>
						<groupId>org.apache.hadoop</groupId>
						<artifactId>hadoop-client</artifactId>
						<version>${hadoop-2.version}</version>
					</dependency>
					<dependency>
						<groupId>org.apache.hadoop</groupId>
						<artifactId>hadoop-annotations</artifactId>
						<version>${hadoop-2.version}</version>
					</dependency>
					<!-- This was marked as test dep in earlier pom, but was scoped compile. 
						Where do we actually need it? -->
					<dependency>
						<groupId>org.apache.hadoop</groupId>
						<artifactId>hadoop-minicluster</artifactId>
						<version>${hadoop-2.version}</version>
						<exclusions>
							<exclusion>
								<groupId>javax.servlet.jsp</groupId>
								<artifactId>jsp-api</artifactId>
							</exclusion>
							<exclusion>
								<groupId>javax.servlet</groupId>
								<artifactId>servlet-api</artifactId>
							</exclusion>
							<exclusion>
								<groupId>stax</groupId>
								<artifactId>stax-api</artifactId>
							</exclusion>
						</exclusions>
					</dependency>
				</dependencies>
			</dependencyManagement>
		</profile>
	</profiles>

YARN log files

Posted on May 10, 2014 by BigData Explorer

Viewing MapReduce job log files has been a pain. With YARN, you can enable the log aggregation. This will pull and aggregate all the individual logs belonging to a MR job and allow one to view the aggregated log with the following command.

You can view your MapReduce job log files using the following command

yarn logs -applicationId <YOUR_APPLICATION_ID> |less

eg.

yarn logs -applicationId application_1399561129519_4422 |less

You can enable the log aggregation in the yarn-site.xml as follows

cat /etc/hadoop/conf/yarn-site.xml

<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>

To see the list of running MapReduce jobs

mapred job -list

To check the status of a MapReduce job

mapped job -status <YOUR_JOB_ID>

Mahout Math Package

Posted on March 31, 2014 by BigData Explorer

I was looking into Mahout Math package especially the various Vector implementations, such as DenseVector, NamedVector, WeightedVector, and etc. Here is the class diagram.

mahout math Vector class diagram

Vector is the base interface for all different Vector implementations. There is a base abstract class, AbstractVector which provides the base implementations of some defined methods, such as norm, normalize, logNormalize, getDistanceSquared, maxValue, minValue, plus, minus, times and etc in the interface.

There are a few interesting methods in the AbstractVector. Eg. createOptimizedCopy(Vector vector)

</pre>
private static Vector createOptimizedCopy(Vector vector) {

Vector result;

if (vector.isDense()) {

result = vector.like().assign(vector, Functions.SECOND_LEFT_ZERO);

} else {

result = vector.clone();

}

return result;

}

ConstantVector implementation is also quite interesting. Since it is a vector of size n with a constant value, it only has a int size member and double value member.

As the name implies, DenseVector is an implementation of a dense vector. It contains an array of double internally. It also provides NonDefaultIterator and AllIterator which implement Iterator interface to allow for each type operation. NonDefaultIterator iterates through non zeros elements whereas AllIterator iterates through all elements.

DelegatingVector implements the Vector interface without extending the AbstractVector. It has a Vector as a delegate internally. WeightedVector extends DelegatingVector and decorate the delegate with a weight and positional index.

</pre>
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.mahout.math;
import org.apache.mahout.math.function.DoubleDoubleFunction;
import org.apache.mahout.math.function.DoubleFunction;

/**
* The basic interface including numerous convenience functions <p/> NOTE: All implementing classes must have a
* constructor that takes an int for cardinality and a no-arg constructor that can be used for marshalling the Writable
* instance <p/> NOTE: Implementations may choose to reuse the Vector.Element in the Iterable methods
*/
public interface Vector extends Cloneable {

/** @return a formatted String suitable for output */
String asFormatString();

/**
* Assign the value to all elements of the receiver
*
* @param value a double value
* @return the modified receiver
*/
Vector assign(double value);

/**
* Assign the values to the receiver
*
* @param values a double[] of values
* @return the modified receiver
* @throws CardinalityException if the cardinalities differ
*/
Vector assign(double[] values);

/**
* Assign the other vector values to the receiver
*
* @param other a Vector
* @return the modified receiver
* @throws CardinalityException if the cardinalities differ
*/
Vector assign(Vector other);

/**
* Apply the function to each element of the receiver
*
* @param function a DoubleFunction to apply
* @return the modified receiver
*/
Vector assign(DoubleFunction function);

/**
* Apply the function to each element of the receiver and the corresponding element of the other argument
*
* @param other a Vector containing the second arguments to the function
* @param function a DoubleDoubleFunction to apply
* @return the modified receiver
* @throws CardinalityException if the cardinalities differ
*/
Vector assign(Vector other, DoubleDoubleFunction function);

/**
* Apply the function to each element of the receiver, using the y value as the second argument of the
* DoubleDoubleFunction
*
* @param f a DoubleDoubleFunction to be applied
* @param y a double value to be argument to the function
* @return the modified receiver
*/
Vector assign(DoubleDoubleFunction f, double y);

/**
* Return the cardinality of the recipient (the maximum number of values)
*
* @return an int
*/
int size();

/**
* @return true iff this implementation should be considered dense -- that it explicitly
* represents every value
*/
boolean isDense();

/**
* @return true iff this implementation should be considered to be iterable in index order in an efficient way.
* In particular this implies that {@link #all()} and {@link #nonZeroes()} ()} return elements
* in ascending order by index.
*/
boolean isSequentialAccess();

/**
* Return a copy of the recipient
*
* @return a new Vector
*/
@SuppressWarnings("CloneDoesntDeclareCloneNotSupportedException")
Vector clone();

Iterable<Element> all();

Iterable<Element> nonZeroes();

/**
* Return an object of Vector.Element representing an element of this Vector. Useful when designing new iterator
* types.
*
* @param index Index of the Vector.Element required
* @return The Vector.Element Object
*/
Element getElement(int index);

/**
* Merge a set of (index, value) pairs into the vector.
* @param updates an ordered mapping of indices to values to be merged in.
*/
void mergeUpdates(OrderedIntDoubleMapping updates);

/**
* A holder for information about a specific item in the Vector. <p/> When using with an Iterator, the implementation
* may choose to reuse this element, so you may need to make a copy if you want to keep it
*/
interface Element {

/** @return the value of this vector element. */
double get();

/** @return the index of this vector element. */
int index();

/** @param value Set the current element to value. */
void set(double value);
}

/**
* Return a new vector containing the values of the recipient divided by the argument
*
* @param x a double value
* @return a new Vector
*/
Vector divide(double x);

/**
* Return the dot product of the recipient and the argument
*
* @param x a Vector
* @return a new Vector
* @throws CardinalityException if the cardinalities differ
*/
double dot(Vector x);

/**
* Return the value at the given index
*
* @param index an int index
* @return the double at the index
* @throws IndexException if the index is out of bounds
*/
double get(int index);

/**
* Return the value at the given index, without checking bounds
*
* @param index an int index
* @return the double at the index
*/
double getQuick(int index);

/**
* Return an empty vector of the same underlying class as the receiver
*
* @return a Vector
*/
Vector like();

/**
* Return a new vector containing the element by element difference of the recipient and the argument
*
* @param x a Vector
* @return a new Vector
* @throws CardinalityException if the cardinalities differ
*/
Vector minus(Vector x);

/**
* Return a new vector containing the normalized (L_2 norm) values of the recipient
*
* @return a new Vector
*/
Vector normalize();

/**
* Return a new Vector containing the normalized (L_power norm) values of the recipient. <p/> See
* http://en.wikipedia.org/wiki/Lp_space <p/> Technically, when 0 < power < 1, we don't have a norm, just a metric,
* but we'll overload this here. <p/> Also supports power == 0 (number of non-zero elements) and power = {@link
* Double#POSITIVE_INFINITY} (max element). Again, see the Wikipedia page for more info
*
* @param power The power to use. Must be >= 0. May also be {@link Double#POSITIVE_INFINITY}. See the Wikipedia link
* for more on this.
* @return a new Vector x such that norm(x, power) == 1
*/
Vector normalize(double power);

/**
* Return a new vector containing the log(1 + entry)/ L_2 norm values of the recipient
*
* @return a new Vector
*/
Vector logNormalize();

/**
* Return a new Vector with a normalized value calculated as log_power(1 + entry)/ L_power norm. <p/>
*
* @param power The power to use. Must be > 1. Cannot be {@link Double#POSITIVE_INFINITY}.
* @return a new Vector
*/
Vector logNormalize(double power);

/**
* Return the k-norm of the vector. <p/> See http://en.wikipedia.org/wiki/Lp_space <p/> Technically, when 0 &gt; power
* &lt; 1, we don't have a norm, just a metric, but we'll overload this here. Also supports power == 0 (number of
* non-zero elements) and power = {@link Double#POSITIVE_INFINITY} (max element). Again, see the Wikipedia page for
* more info.
*
* @param power The power to use.
* @see #normalize(double)
*/
double norm(double power);

/** @return The minimum value in the Vector */
double minValue();

/** @return The index of the minimum value */
int minValueIndex();

/** @return The maximum value in the Vector */
double maxValue();

/** @return The index of the maximum value */
int maxValueIndex();

/**
* Return a new vector containing the sum of each value of the recipient and the argument
*
* @param x a double
* @return a new Vector
*/
Vector plus(double x);

/**
* Return a new vector containing the element by element sum of the recipient and the argument
*
* @param x a Vector
* @return a new Vector
* @throws CardinalityException if the cardinalities differ
*/
Vector plus(Vector x);

/**
* Set the value at the given index
*
* @param index an int index into the receiver
* @param value a double value to set
* @throws IndexException if the index is out of bounds
*/
void set(int index, double value);

/**
* Set the value at the given index, without checking bounds
*
* @param index an int index into the receiver
* @param value a double value to set
*/
void setQuick(int index, double value);

/**
* Increment the value at the given index by the given value.
*
* @param index an int index into the receiver
* @param increment sets the value at the given index to value + increment;
*/
void incrementQuick(int index, double increment);

/**
* Return the number of values in the recipient which are not the default value. For instance, for a
* sparse vector, this would be the number of non-zero values.
*
* @return an int
*/
int getNumNondefaultElements();

/**
* Return the number of non zero elements in the vector.
*
* @return an int
*/
int getNumNonZeroElements();

/**
* Return a new vector containing the product of each value of the recipient and the argument
*
* @param x a double argument
* @return a new Vector
*/
Vector times(double x);

/**
* Return a new vector containing the element-wise product of the recipient and the argument
*
* @param x a Vector argument
* @return a new Vector
* @throws CardinalityException if the cardinalities differ
*/
Vector times(Vector x);

/**
* Return a new vector containing the subset of the recipient
*
* @param offset an int offset into the receiver
* @param length the cardinality of the desired result
* @return a new Vector
* @throws CardinalityException if the length is greater than the cardinality of the receiver
* @throws IndexException if the offset is negative or the offset+length is outside of the receiver
*/
Vector viewPart(int offset, int length);

/**
* Return the sum of all the elements of the receiver
*
* @return a double
*/
double zSum();

/**
* Return the cross product of the receiver and the other vector
*
* @param other another Vector
* @return a Matrix
*/
Matrix cross(Vector other);

/*
* Need stories for these but keeping them here for now.
*/
// void getNonZeros(IntArrayList jx, DoubleArrayList values);
// void foreachNonZero(IntDoubleFunction f);
// DoubleDoubleFunction map);
// NewVector assign(Vector y, DoubleDoubleFunction function, IntArrayList
// nonZeroIndexes);

/**
* Examples speak louder than words: aggregate(plus, pow(2)) is another way to say
* getLengthSquared(), aggregate(max, abs) is norm(Double.POSITIVE_INFINITY). To sum all of the positive values,
* aggregate(plus, max(0)).
* @param aggregator used to combine the current value of the aggregation with the result of map.apply(nextValue)
* @param map a function to apply to each element of the vector in turn before passing to the aggregator
* @return the final aggregation
*/
double aggregate(DoubleDoubleFunction aggregator, DoubleFunction map);

/**
* <p>Generalized inner product - take two vectors, iterate over them both, using the combiner to combine together
* (and possibly map in some way) each pair of values, which are then aggregated with the previous accumulated
* value in the combiner.</p>
* <p>
* Example: dot(other) could be expressed as aggregate(other, Plus, Times), and kernelized inner products (which
* are symmetric on the indices) work similarly.
* @param other a vector to aggregate in combination with
* @param aggregator function we're aggregating with; fa
* @param combiner function we're combining with; fc
* @return the final aggregation; if r0 = fc(this[0], other[0]), ri = fa(r_{i-1}, fc(this[i], other[i]))
* for all i > 0
*/
double aggregate(Vector other, DoubleDoubleFunction aggregator, DoubleDoubleFunction combiner);

/** Return the sum of squares of all elements in the vector. Square root of this value is the length of the vector. */
double getLengthSquared();

/** Get the square of the distance between this vector and the other vector. */
double getDistanceSquared(Vector v);

/**
* Gets an estimate of the cost (in number of operations) it takes to lookup a random element in this vector.
*/
double getLookupCost();

/**
* Gets an estimate of the cost (in number of operations) it takes to advance an iterator through the nonzero
* elements of this vector.
*/
double getIteratorAdvanceCost();

/**
* Return true iff adding a new (nonzero) element takes constant time for this vector.
*/
boolean isAddConstantTime();
}

Wei Shung Chung

Wei Shung Chung – Hadoop, HBase, MapReduce, Spark, Spark ML, Machine Learning, Deep Learning

Author Archives: BigData Explorer