Spark: Analyzing Stock Price

Simple moving average is an indicator many people use in analyzing stock price. Here I want to show how to use Spark’s window function to compute the moving average easily.

First, lets load the stock data

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

case class Stock(exc: String, symbol: String, stockDate: String, open: Double, high: Double, low: Double, close: Double,
                   volume: Double, adjClose: Double)

  val data = sc.textFile("s3://giantify-stocks/APC_2016_08_03.csv")
  val stocksData = { d =>
    val tokens = d.split(",")
    Stock(tokens(0), tokens(1), tokens(2), tokens(3).toDouble, tokens(4).toDouble, tokens(5).toDouble, tokens(6).toDouble, tokens(7).toDouble,

  val stocks = stocksData.withColumn("stockDate", to_date(col("stockDate")))

Next we will compute the 20 days, 50 days, 100 days simple moving averages

val movingAverageWindow20 = Window.orderBy("stockDate").rowsBetween(-20, 0)
val movingAverageWindow50 = Window.orderBy("stockDate").rowsBetween(-50, 0)
val movingAverageWindow100 = Window.orderBy("stockDate").rowsBetween(-100, 0)

// Calculate the moving average
val stocksMA = stocks.withColumn( "MA20", avg(stocks("close")).over(movingAverageWindow20)).withColumn( "MA50", avg(stocks("close")).over(movingAverageWindow50)).withColumn("MA100", avg(stocks("close")).over(movingAverageWindow100))

stocksMA.filter("close > MA50").select(col("stockDate"), col("close"), col("MA50")).show()

With the moving average calculated, let’s find when closing price exceeds the 50 days moving average

stocksMA.filter("close > MA50").select(col("stockDate"), col("close"), col("MA50")).show()
Screen Shot 2016-08-21 at 11.29.13 AM

Stay tuned for the next blog on how to use Zeppelin to visualize the price data


Stock Clustering

Diversification in stock portfolio is always desired to minimize the risk. One of the many ways is to cluster these stocks into many categories in which each category exhibits similar behavior.

Here are a few categories I identified with some stocks by applying simple clustering algorithm.

Category 1


Screen Shot 2015-12-25 at 9.32.29 PM

Screen Shot 2015-12-25 at 9.33.47 PM

Category 2:

Screen Shot 2015-12-25 at 9.47.56 PM

Screen Shot 2015-12-25 at 9.49.17 PM

Category 3:

Screen Shot 2015-12-25 at 9.50.31 PM.png

Screen Shot 2015-12-25 at 9.53.13 PM

Category 4:

Screen Shot 2015-12-25 at 9.55.56 PM

Screen Shot 2015-12-25 at 9.56.46 PM

In the experiment, I tried with different number of clusters and calculated its corresponding cost. See the following chart, I chose 15 as the ideal number of clusters to cluster 93 stocks I have in the portfolio.

Screen Shot 2015-12-25 at 10.09.34 PM

The main goal of this exercise is to build a balanced portfolio with combination of stocks from different categories to minimize risk.

Disclaimer: Trading is risky and you can lose all money. Past performance is not a guide to future performance. The content is intended to be used and must be used for informational purposes only. It is very important to do your own analysis and consult a financial professional’s advice to verify our content. It is very important to do your own analysis before making any investment based on your own personal circumstances.

Stock Prediction

Over the last weekend, I spent some hours tuning and enhancing my new prediction model for stock BAC. The result is quite promising. I still need to do more experiments before replacing my existing model on
Stay tuned for more infos.

Take a look at the following chart, you can see some clusters of buy signals generated before the peak.


You can check out my stock prediction model at