Useful Spark Code Snippets for Data Analytics

Here are some Spark code snippets you will find particularly useful when performing basic big data analytics

Read CSV

import com.databricks.spark.csv._
val data ="com.databricks.spark.csv").option("header","true").option("inferSchema","true").load(YOUR_INPUT_PATH)


import com.databricks.spark.avro._
val data =


val data =

Most often or not, you will probably perform some aggregations

val result = data.groupBy("company_branch").count().sort(desc("count"))
val result = data.groupBy("company_branch", "department").count().sort(asc("company_branch"),desc("count"))

You would want to save your results back to CSV file again


If you want to consolidate all the result part files into one single file, you can use the coalesce(1) method


To perform projection/selection,"name"), col("age"), col("department_name").alias("dept"))

To perform filtering

data.filter("age > 18")

To use SQL, call the registerTempTable method on the dataframe

sqlContext.sql("select name, age from data")

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s