Spark SQL | Wei Shung Chung

Here are some Spark code snippets you will find particularly useful when performing basic big data analytics

Read CSV

import com.databricks.spark.csv._
val data = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("inferSchema","true").load(YOUR_INPUT_PATH)

Read AVRO

import com.databricks.spark.avro._
val data = sqlContext.read.avro(YOUR_INPUT_PATH)

Read JSON

val data = sqlContext.read.json(YOUR_INPUT_PATH)

Most often or not, you will probably perform some aggregations

val result = data.groupBy("company_branch").count().sort(desc("count"))

val result = data.groupBy("company_branch", "department").count().sort(asc("company_branch"),desc("count"))

You would want to save your results back to CSV file again

result.write.format("com.databricks.spark.csv").save(YOUR_OUTPUT_PATH)

If you want to consolidate all the result part files into one single file, you can use the coalesce(1) method

result.coalesce(1).write.format("com.databricks.spark.csv").save(YOUR_OUTPUT_PATH)

To perform projection/selection,

data.select(col("name"), col("age"), col("department_name").alias("dept"))

To perform filtering

data.filter("age > 18")

To use SQL, call the registerTempTable method on the dataframe

data.registerTempTable("data")
sqlContext.sql("select name, age from data")

Wei Shung Chung

Wei Shung Chung – Hadoop, HBase, MapReduce, Spark, Spark ML, Machine Learning, Deep Learning

Category Archives: Spark SQL

ANTLR grammar specification for Spark SQL

Useful Spark Code Snippets for Data Analytics