From my last blog, I calculated the missing value percentage for every columns in the data. Next thing to do is to perform imputation of missing values in selected columns before model training.
However, I am going to skip doing in depth data cleaning and feature selection eg. removing outliers and calculating correlations between features, etc. I will do that in the next blog 😉
Instead I am going to use the following features to build a model. These columns have low missing value percentage.
"bedroomcnt" "bathroomcnt" "roomcnt" "taxamount" "taxvaluedollarcnt" "lotsizesquarefeet" "finishedsquarefeet12" "latitude" "longitude"
I replaced missing values in bedroomcnt with 3 and bathroomcnt with 2 (use most frequent value) and drop any row that has any missing values in the other columns.
In this experiment, I used Spark to train a Gradient Boost Tree Regression model. I built a Spark ML pipeline to perform hyperparameter tuning of GBT. Basically, it will test a grid of different hyperparameters and choose the best parameters based on the evaluation metric, RMSE.
val paramGrid = new ParamGridBuilder() .addGrid(gbt.maxDepth, Array(2,5)) .addGrid(gbt.maxIter, Array(50,100)) .build()
Next I will need to go back to data cleaning and feature selection to choose better/more features to improve the model.