Deprecated functions in org.apache.spark.sql. functions in Spark 2.0

I just moved some of my spark codes from 1.6.0 to 2.2.0 and discovered that some functions in org.apache.spark.sql.functions._ are being replaced/renamed.

To name a few, see below

1) rowNumber() is replaced by row_number()

import org.apache.spark.sql.functions._
/**
* @group window_funcs
* @deprecated As of 1.6.0, replaced by `row_number`. This will be removed in Spark 2.0.
*/
@deprecated("Use row_number. This will be removed in Spark 2.0.", "1.6.0")
def rowNumber(): Column = row_number()

2) isNaN is replaced by isnan

/**
   * @group normal_funcs
   * @deprecated As of 1.6.0, replaced by `isnan`. This will be removed in Spark 2.0.
   */
  @deprecated("Use isnan. This will be removed in Spark 2.0.", "1.6.0")
  def isNaN(e: Column): Column = isnan(e)

3) inputFileName() is replaced by input_file_name

/**
   * @group normal_funcs
   * @deprecated As of 1.6.0, replaced by `input_file_name`. This will be removed in Spark 2.0.
   */
  @deprecated("Use input_file_name. This will be removed in Spark 2.0.", "1.6.0")
  def inputFileName(): Column = input_file_name()

To get the full list of all the replaced/renamed functions, refer to this code
https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/functions.scala

InstaCart: Product Recommender

InstaCart has recently open sourced anonymized data on customer orders for its Kaggle competition. You can find out more info from the link below.

https://www.kaggle.com/c/instacart-market-basket-analysis

Previously, I did some data exploration to discover some interesting insights. See my previous blog post.

Next I would like to use their anonymized data to build a product recommender system. There are many approaches/strategies in product recommendations, eg. Most popular items, Also bought/viewed, Featured items, and etc. I am going to explore the most popular items approach and collaborative filtering (also bought) approach.

We can identify the top 5 popular items for each department and utilize them for most popular items recommendation. For example, the top 5 items for frozen department are Blueberries, Organic Broccoli Florets, Organic Whole Strawberries, Pipeapple Chunks, Frozen Organic Wild Blueberries.

Screen Shot 2017-07-10 at 10.51.55 PM

Next, I utilized Spark ALS algorithm to  build a collaborative filtering based recommender system.

Some stats about the data

Total user-product pair (rating): 1,384,617

Total users: 131,209

Total products: 39,123

Given the data, I splitted the data into train set (0.8) vs test set (0.2) randomly. This resulted in
Number of train elements: 1,107,353
Number of test elements: 277,264

Here are the parameters I used in ALS, rank = 10, lambda = 1, number of iterations = 10, 50, 60, 80

  • Number of iterations = 10,    RMSE = 0.9999895
  • Number of iterations = 50,    RMSE = 0.9999828
  • Number of iterations = 60,    RMSE = 0.9999875
  • Number of iterations = 80,    RMSE = 0.9999933

We can also try out different ranks in grid search.

Here are some recommendations suggested by ALS collaborative filtering algorithm (number of iterations =80, rank=10)

Given user 124383 Transaction History

+——-+———-+—–+———-+—————————+——–+————-+
|user_id|product_id|count|product_id|product_name |aisle_id|department_id|
+——-+———-+—–+———-+—————————+——–+————-+
|124383 |49478 |1 |49478 |Frozen Organic Strawberries|24 |4 |
|124383 |21903 |1 |21903 |Organic Baby Spinach |123 |4 |
|124383 |19508 |1 |19508 |Corn Tortillas |128 |3 |
|124383 |20114 |1 |20114 |Jalapeno Peppers |83 |4 |
|124383 |44142 |1 |44142 |Red Onion |83 |4 |
|124383 |20345 |1 |20345 |Thin Crust Pepperoni Pizza |79 |1 |
|124383 |27966 |1 |27966 |Organic Raspberries |123 |4 |
+——-+———-+—–+———-+—————————+——–+——————+

Here are the recommendations
+———-+————————————————————-+——–+————-+
|product_id|product_name |aisle_id|department_id|
+———-+————————————————————-+——–+——————+
|28717 |Sport Deluxe Adjustable Black Ankle Stabilizer |133 |11 |
|15372 |Meditating Cedarwood Mineral Bath |25 |11 |
|18962 |Arroz Calasparra Paella Rice |63 |9 |
|2528 |Cluckin’ Good Stew |40 |8 |
|21156 |Dreamy Cold Brew Concentrate |90 |7 |
|12841 |King Crab Legs |39 |12 |
|24862 |Old Indian Wild Cherry Bark Syrup |47 |11 |
|37535 |Voluminous Extra-Volume Collagen Mascara – Blackest Black 680|132 |11 |
|30847 |Wild Oregano Oil |47 |11 |
+———-+————————————————————-+——–+————-+

 

 

 

 

 

Instacart Data Insight

InstaCart has recently organized a Kaggle competition and published anonymized data on customer orders over time to predict which previously purchased products will be in a user’s next order. You can check out the link below.

https://www.kaggle.com/c/instacart-market-basket-analysis

First, I used Spark to analyze the data to uncover hidden insights.

We were given the departments data

Screen Shot 2017-07-05 at 10.04.34 PM

Products data

Screen Shot 2017-07-05 at 10.07.32 PM

Orders data

Screen Shot 2017-07-05 at 10.07.47 PM

Lets find out which is the busiest day of the week. As it turns out, day 0 is the busiest day, with 600,905 orders followed by day 1 with 587478 orders. I would assume day 0 is Sunday and day 1 is Monday.

Screen Shot 2017-07-05 at 10.15.12 PM.png

Next, lets figure out which is the busiest hour in a day. It turns out 10 am is the busiest hour with 288,418 orders, followed by 11 am with 284,728 orders.

As you can see, Instacart customers like to shop from 11 am to 5 pm. It would be interesting to see the day of week + hour of day breakdown too.

Screen Shot 2017-07-05 at 10.18.24 PM

Here is the breakdown of popular hours for each day. 10 am dominates the top spot.

Screen Shot 2017-07-06 at 12.59.15 AM

Next lets figure out the top ten popular items among Instacart customers. Surprisingly, banana is the most popular item.

Screen Shot 2017-07-05 at 10.09.13 PM.png

Lets find out the top item for each department. We can see that Blueberries is the most popular item for frozen department and Extra Virgin Olive Oil is the most popular item for pantry department. Some unexpected surprises are dried mango, cotton swabs, and honey nut cheerios.

Screen Shot 2017-07-10 at 10.18.57 PM

We are also interested in knowing the reorder interval, how many days since the prior order before making the next order.

Screen Shot 2017-07-05 at 11.47.04 PM.png

As we discovered, 30 days is the most frequent reorder interval. It looks like most of customers reorder once a month. On the other hand, 320,608 orders were placed after 7 days since the prior order. We can concluded that majority of customers reorder after 1 month or 1 week since the prior order.

Stay tuned for next blog on my study results at the individual user level.

Scala Enumeration

In Java we use enum to represent fixed set of constants

For example, we would define days of week enum type as follows

public enum Day {
    SUNDAY, MONDAY, TUESDAY, WEDNESDAY,
    THURSDAY, FRIDAY, SATURDAY 
}

In Scala, we can do the same thing by extending Enumeration, for example

object Day extends Enumeration {
  type Day = Value
  val SUNDAY, MONDAY, TUESDAY, WEDNESDAY, THURSDAY, FRIDAY, SATURDAY = Value
}

You can find examples of Scala Enumeration usage in Spark
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/TaskState.scala

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/InputMetrics.scala

Running Spark Job in AWS Data Pipeline

If you want to run Spark job in AWS data pipeline, add an EmrActivity and use command-runner.jar to submit the spark job.

In the Step field box of the EmrActivity node, enter the command as follows

command-runner.jar,spark-submit,--master,yarn-cluster,--deploy-mode,cluster,--class,com.yourcompany.yourpackage.YourClass,s3://PATH_TO_YOUR_JAR,YOUR_PROGRAM_ARGUMENT_1,YOUR_PROGRAM_ARGUMENT_2,YOUR_PROGRAM_ARGUMENT_3

Some useful resources
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-submit-step.html
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-commandrunner.html

BLAS in MLlib

I was looking into BLAS.scala routines for MLlib’s vectors and matrices. It looks like Spark uses the F2jblas for level 1 routines.

https://github.com/apache/spark/blob/master/mllib-local/src/main/scala/org/apache/spark/ml/linalg/BLAS.scala

// For level-1 routines, we use Java implementation.
private def f2jBLAS: NetlibBLAS = {
  if (_f2jBLAS == null) {
    _f2jBLAS = new F2jBLAS
  }
  _f2jBLAS
}

Here are some great resources about different BLAS implementations
http://www.spark.tc/blas-libraries-in-mllib/

https://blog.cloudera.com/blog/2017/02/accelerating-apache-spark-mllib-with-intel-math-kernel-library-intel-mkl/

BLAS routines (Level 1,2,3)
http://www.netlib.org/blas/#_blas_routines

Render Json using Jackson in Scala

If you use Jackson Json library in Scala, remember to register the DefaultScalaModule so that ObjectMapper can convert List, Array to Json correctly. See below.

 
val objectMapper = new ObjectMapper()
objectMapper.registerModule(DefaultScalaModule)

Simple example:

 
import com.fasterxml.jackson.annotation.JsonAutoDetect.Visibility
import com.fasterxml.jackson.annotation.{JsonProperty, PropertyAccessor}
import com.fasterxml.jackson.databind.ObjectMapper
import com.fasterxml.jackson.module.scala.DefaultScalaModule



object JsonExample {
  case class Car(@JsonProperty("id")  id: Long)
  case class Person(@JsonProperty("name") name: String = null,
                    @JsonProperty("cars") cars: Seq[Car] = null)

  def main(args:Array[String]):Unit = {
    val car1 = Car(12345)
    val car2 = Car(12346)
    val carsOwned = List(car1, car2)
    var person = Person(name="wei", cars=carsOwned)

    val objectMapper = new ObjectMapper()
    objectMapper.registerModule(DefaultScalaModule)
    objectMapper.setVisibility(PropertyAccessor.ALL, Visibility.NONE)
    objectMapper.setVisibility(PropertyAccessor.FIELD, Visibility.ANY)
    println(s"person: ${objectMapper.writeValueAsString(person)}")
  }
}

Output:
person: {“name”:”wei”,”cars”:[{“id”:12345},{“id”:12346}]}

StackoverflowError when running ALS in Spark’s MLlib

If you ever encounter StackoverflowError when running ALS in Spark’s MLLib, the solution is to turn on checkpointing as follows

sc.setCheckpointDir(‘your_checkpointing_dir/’)

Check out the Jira ticket regarding the issue and pull request below

https://issues.apache.org/jira/browse/SPARK-1006

https://github.com/apache/spark/pull/5076

Spark Dedup before Join

In Spark, as with any SQL left outer join, it will produce more rows than the total number of rows in the left table if the right table has duplicates.

You could first drop the duplicates on the right table before performing join as follows.

myDF.dropDuplicates(“myJoinkey”)

Or you could also do a groupBy and aggregate

Take a look at this dedup example

https://github.com/spirom/LearningSpark/blob/master/src/main/scala/dataframe/DropDuplicates.scala

Tuning Spark Jobs

Here are some useful resources on how to tune Spark job in terms of number of executors, executor memory and number of cores.

http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

http://etlcode.com/index.php/blog/info/Bigdata/Apache-Spark-tuning-spark-jobs-Optimal-setting-for-executor-core-and-memory

https://stackoverflow.com/questions/37871194/how-to-tune-spark-executor-number-cores-and-executor-memory

https://aranair.github.io/posts/2017/03/10/tuning-my-apache-spark-cluster-on-aws-emr/