This post covers modelling and evaluation steps from the CRISP-DM methodology.
When making decisions in business it’s prudent to use business facts. I’m sure I have no need to convince you of this.
Sometimes the decision we’re trying to make involves many variables, actors, events and outcomes. This can make it hard to understand the data. For instance, how do we make a decision on who to market a new product to? All the customers? Some of them? What about a complex investment decision?
Decision Trees can be used to solve this problem. I remember learning about Decision Trees when I was studying for my CIMA exams to predict the expected value from a business decision. I’ve used the technique many times to analyse outcomes.
Here’s a basic example.
Decision Trees are a popular tool in business for evaluating decisions as they breakdown complex decisions into parts that are easier for people to understand. Of course, in the example above I’ve only included one uncertain event. In reality there are many.
The predictive nature of decision trees means they are also employed for supervised learning models for classification and regression questions. With modern analytical tools like R we can push data in and using algorithms we can create a decision tree to classify the data.
The limitation of Decision Trees comes from the fact that the algorithms are top-down and greedy, there is no backtracking. There can be a tendency to over-fit the training data. If you want to learn more about the algorithms there’s an excellent blog post covering this in more detail here.
Modelling with Random Forests
Random Forests are an ensemble learning method that operate by building multiple trees and outputting the mean prediction. We can then use the mean prediction by applying the model to the test data.
Random Forests display excellent levels of accuracy and the algorithm is well suited to large data sets. We also get feedback about variable importance which gives analysts a feedback loop for tuning the model.
Let’s have a look at an example with R using classification to predict salary levels.
We start with pre-cleaned data that has been partitioned into training and validation sets. The original source for this data is the “Adult” dataset hosted on UCI’s Machine Learning Repository.
The purpose of the model is to predict who earns more than $50K income. In the training data the Income variable holds an integer 1 if true and 0 if false.
Let’s get into the R code. I’ve made the pre-cleaned data available on Dropbox.
library(randomForest) library(ROCR) #Data has been pre-cleaned and pre-processed clean_data <-"https://dl.dropboxusercontent.com/u/47056051/RandomForests/clean_data.txt" download.file(clean_data, "data.txt") data <- dget("data.txt", keep.source = FALSE)
The Random Forest algorithm can be pre-tuned to optimise the model. We can change the parameters by hand or use functions to give us the optimal value. One such value is the number of variables split at each node of the tree (mtry).
bestmtry <- tuneRF(data$train[-13],data$train$income,ntreeTry=100, stepFactor=1.5,improve=0.01, trace=TRUE, plot=TRUE, dobest=FALSE)
You can skip this step and Random Forest will take the sqrt(number of variables) as the default value.
According to tuneRF the optimal mtry is 2. With this information we can call the randomForest function to create the model. This can take a few minutes to run depending on the power of your machine. Please be patient.
adult_rf <- randomForest(income~.,data=data$train, mtry=2, ntree=1000, keep.forest=TRUE, importance=TRUE,test=data$val)
When we have the Random Forest model we can easily apply it to the validation set.
adult_rf_pr <- predict(adult_rf,type="prob",newdata=data$val)[,2] adult_rf_pred <- prediction(adult_rf_pr, data$val$income)
It’s important to evaluate the model using diagnostic tests. One such test is the ROC curve which shows us how accurate our model is based on analysing True Positives vs False Positives.
adult_rf_perf <- performance(adult_rf_pred,"tpr","fpr") plot(adult_rf_perf,main="ROC Curve for Random Forest",col=2,lwd=2) abline(a=0,b=1,lwd=2,lty=2,col="gray") dev.copy(png,filename="rocplot.png") dev.off()
The dashed line at the 45 degree angle is the ‘best guess’. The perfect outcome is seen when the area under the curve is pointed towards to the top left hand corner of the chart. We have a good prediction here as the curve is much better than the ‘best guess’.
Random Forests gives us information about the variable importance. This is another feedback loop we can use to optimise the model.
#Check for variable importance importance(adult_rf) varImpPlot(adult_rf)
We can see from this chart that Country of Origin and Race variables are not contributing much to the model. The next step would be to remove these variables and re-run. I’ll leave that to you.
To conclude, Random Forests are used extensively in data models (see Kaggle.com). They are highly accurate and are well suited to large data sets. With many tuning options we can customise the model to our data. One of the downsides is that they can be a little opaque – just like a real forest we struggle to see the wood from the trees. Having said this, I’ve yet to see a model that is truly transparent.
You can download the complete R code here.