For those who are curious, here is the ANTLR grammar specification for Spark SQL.
An adaptation of Presto’s presto-parser/src/main/antlr4/com/facebook/presto/sql/parser/SqlBase.g4 grammar
For those who are curious, here is the ANTLR grammar specification for Spark SQL.
An adaptation of Presto’s presto-parser/src/main/antlr4/com/facebook/presto/sql/parser/SqlBase.g4 grammar
Here are some Spark code snippets you will find particularly useful when performing basic big data analytics
Read CSV
import com.databricks.spark.csv._ val data = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("inferSchema","true").load(YOUR_INPUT_PATH)
Read AVRO
import com.databricks.spark.avro._ val data = sqlContext.read.avro(YOUR_INPUT_PATH)
Read JSON
val data = sqlContext.read.json(YOUR_INPUT_PATH)
Most often or not, you will probably perform some aggregations
val result = data.groupBy("company_branch").count().sort(desc("count"))
val result = data.groupBy("company_branch", "department").count().sort(asc("company_branch"),desc("count"))
You would want to save your results back to CSV file again
result.write.format("com.databricks.spark.csv").save(YOUR_OUTPUT_PATH)
If you want to consolidate all the result part files into one single file, you can use the coalesce(1) method
result.coalesce(1).write.format("com.databricks.spark.csv").save(YOUR_OUTPUT_PATH)
To perform projection/selection,
data.select(col("name"), col("age"), col("department_name").alias("dept"))
To perform filtering
data.filter("age > 18")
To use SQL, call the registerTempTable method on the dataframe
data.registerTempTable("data") sqlContext.sql("select name, age from data")