Deploying R/Python scripts

When we clean and transform data and go on to produce a great predictive model the last thing we want is to not deploy it.

The issue we have is deployment often means heavy engineering. Depending on the resources at your disposal this can be a real issue.

At the last LondonR meet up Jo-Fai Chow gave a talk about Dominodatalab.  Domino provide a web service that lets us upload an R/Python script and have it exposed over a REST API. This API can be easily called via a web request from an app.

Here’s an example of a sample script that has been deployed.  There’s a lot more to Domino that I’m still learning about but based on early research it seems a great way to deploy models.

Domino Dashboard

 

Posted in R | Tagged , | Leave a comment

Interactive time series with 3 lines of R

Time Series analysis is a fundamental aspect of performance management.   Before we can begin to model time-series there’s huge dividend in just visualising the data.  We can see if there is a seasonal pattern, or is the time-series additive or multiplicative.

R comes with good plotting but one thing I’ve always wanted to do is zoom around a time-series. This is because patterns at the lower level are not always visible. We often have to group time to see the pattern but this is a transform I could do without.

At the last R London meetup I saw an excellent presentation from Enzo Martoglio who demo’d HTML Widgets.  HTML widgets for  R provide a bridge between R and HTML/JavaScript visualization libraries.

This opens a door for analysts and enables modern tools to be used for communicating model output.  I’ll leave that for another post. Back to time-series.

I kid you not, with 3 lines of code I was able to create an interactive time-series chart with zoom.  The package to use is dygraphs.   Basic example below.

 
#Data sourced from Rob Hyndman's Time Series Data Library 
#Hyndman, R.J. Time Series Data Library, http://data.is/TSDLdemo 
#Data Description 
#Monthly sales for a souvenir shop on the wharf at a beach resort town 
#in Queensland, Australia. Jan 1987-Dec 1993 

library(dygraphs) 
library(fma) 
dygraph(fancy,main = "Monthly Sales") %>% dyRangeSelector() 

Interactive Chart

We can see the seasonal pattern and the multiplicative trend.  This means a log transform will be needed before an additive model can be used. I’ll be posting more about time-series in the coming months.

These widgets can be output to the usual places:

  • R Console
  • R Markdown document
  • Shiny Web App

There are so many use cases for this.  It’s going to be interesting to see where things go.

 

Posted in R, Time Series | Tagged , , | Leave a comment

Random Forests with R

This post covers modelling and evaluation steps from the CRISP-DM methodology.

When making decisions in business it’s prudent to use business facts. I’m sure I have no need to convince you of this.

Sometimes the decision we’re trying to make involves many variables, actors, events and outcomes.  This can make it hard to understand the data.   For instance, how do we make a decision on who to market a new product to?  All the customers?  Some of them?  What about a complex investment decision?

Decision Trees can be used to solve this problem.  I remember learning about Decision Trees when I was studying for my CIMA exams to predict the expected value from a business decision.  I’ve used the technique many times to analyse outcomes.

Here’s a basic example.

decision tree

Decision Trees are a popular tool in business for evaluating decisions as they breakdown complex decisions into parts that are easier for people to understand. Of course, in the example above I’ve only included one uncertain event.  In reality there are many.

The predictive nature of decision trees means they are also employed for supervised learning models for classification and regression questions.  With modern analytical tools like R we can push data in and using algorithms we can create a decision tree to classify the data.

The limitation of Decision Trees comes from the fact that the algorithms are top-down and greedy, there is no backtracking.  There can be a tendency to over-fit the training data. If you want to learn more about the algorithms there’s an excellent blog post covering this in more detail here.

Modelling with Random Forests

Random Forests are an ensemble learning method that operate by building multiple trees and outputting the mean prediction.   We can then use the mean prediction by applying the model to the test data.

Random Forests display excellent levels of accuracy and the algorithm is well suited to large data sets.  We also get feedback about variable importance which gives analysts a feedback loop for tuning the model.

Let’s have a look at an example with R using classification to predict salary levels.

We start with pre-cleaned data that has been partitioned into training and validation sets.  The original source for this data is the “Adult” dataset hosted on UCI’s Machine Learning Repository.

The purpose of the model is to predict who earns more than $50K income.  In the training data the Income variable holds an integer 1 if true and 0 if false.

Let’s get into the R code. I’ve made the pre-cleaned data available on Dropbox.

library(randomForest)
library(ROCR)

#Data has been pre-cleaned and pre-processed
clean_data <-"https://dl.dropboxusercontent.com/u/47056051/RandomForests/clean_data.txt"
download.file(clean_data, "data.txt")
data <- dget("data.txt", keep.source = FALSE)

The Random Forest algorithm can be pre-tuned to optimise the model. We can change the parameters by hand or use functions to give us the optimal value. One such value is the number of variables split at each node of the tree (mtry).

bestmtry <- tuneRF(data$train[-13],data$train$income,ntreeTry=100,
                   stepFactor=1.5,improve=0.01, trace=TRUE, plot=TRUE, dobest=FALSE)

You can skip this step and Random Forest will take the sqrt(number of variables) as the default value.

According to tuneRF the optimal mtry is 2. With this information we can call the randomForest function to create the model. This can take a few minutes to run depending on the power of your machine. Please be patient.

adult_rf <- randomForest(income~.,data=data$train, mtry=2, ntree=1000,
                        keep.forest=TRUE, importance=TRUE,test=data$val)

When we have the Random Forest model we can easily apply it to the validation set.

adult_rf_pr <- predict(adult_rf,type="prob",newdata=data$val)[,2]
adult_rf_pred <- prediction(adult_rf_pr, data$val$income)

 

Diagnostics

It’s important to evaluate the model using diagnostic tests. One such test is the ROC curve which shows us how accurate our model is based on analysing True Positives vs False Positives.

adult_rf_perf <- performance(adult_rf_pred,"tpr","fpr")
plot(adult_rf_perf,main="ROC Curve for Random Forest",col=2,lwd=2)
abline(a=0,b=1,lwd=2,lty=2,col="gray")
dev.copy(png,filename="rocplot.png")
dev.off()

rocplot

The dashed line at the 45 degree angle is the ‘best guess’. The perfect outcome is seen when the area under the curve is pointed towards to the top left hand corner of the chart. We have a good prediction here as the curve is much better than the ‘best guess’.

Random Forests gives us information about the variable importance. This is another feedback loop we can use to optimise the model.

#Check for variable importance
importance(adult_rf)
varImpPlot(adult_rf)

importance

We can see from this chart that Country of Origin and Race variables are not contributing much to the model. The next step would be to remove these variables and re-run. I’ll leave that to you.

To conclude, Random Forests are used extensively in data models (see Kaggle.com). They are highly accurate and are well suited to large data sets. With many tuning options we can customise the model to our data. One of the downsides is that they can be a little opaque – just like a real forest we struggle to see the wood from the trees. Having said this, I’ve yet to see a model that is truly transparent.

You can download the complete R code here.

 

Posted in R | Tagged , | Leave a comment

Why is R so useful

As an Excel power user (someone called me a guru recently!) I know Excel can be used to do pretty much anything – I’ve even seen Excel being used to play the Game of Life.   If this is the case why do we need R?

In this post I’ll tell you why and then show you.

Reproducible

We can write an R script once to do any of the  following :

  • Acquire data
  • Clean
  • Transform
  • Analyse
  • Model
  • Report
  • Publish

If the R is written in the correct way it’s reproducible by default.  This is very beneficial if data is dynamic or if you need to pass the script to a colleague.  You don’t want them getting a different result.

Flexible

Excel is flexible as mentioned above.  Excel is also limited by the resources Microsoft decide to invest in the product.  Even if Microsoft had the resources I think they would only include functionality that is useful for a broad spectrum of people.

How about R?   First off, it’s open source which means if  we need a real niche feature we can write it ourselves and many people do.   Furthermore, the architecture of R is based around a package system which makes it highly adaptable.   To be fair, Excel has extensibility (VBA, COM, .NET) but to write an extension one has to learn a different language.   Packages in R are written with R.  There’s a little extra to learn but learn to use R and you’re very close to being a package author.

At the time of writing there are over 6000 packages available for use.  Given R’s long lineage I am willing to bet that the problem you have will have been solved by someone else.

Scalability and Availability

You can run R in lots of different places:

  • Laptop
  • Tablet
  • Cloud
  • In databases
  • On really big machines

It’s important to know that R runs in-memory. This makes it very fast but also means if you run out of memory your stuck.  This isn’t a major problem as memory is cheap.  If you really want power you can run on the big Amazon/Microsoft/Google cloud services.

Publish 

With R we can send the output to lots of different places.  We’re not constrained by a worksheet.

  • Slides
  • Document
  • Web page
  • Application

Okay, enough theory.  Here’s an example of customer analysis.   The problem we’re seeing is that customers are disappearing.  We need to find more insight before we can look to mitigate the problem.

As you can see in the Excel screenshot below, I’ve pulled in data from a database and grouped customers based on demographics. We see 12 months of customer count.

The second table is a copy of the first but with customer count aligned to the left (for charting/formula clarity).   I did this by hand. You could it with a formula to be fair.  The next table calculates Retention which takes the current month count and divides by month 1.   With this third table it’s simple to create a chart.

excel-cust

We’ve solved our problem.  What is going on with group 4?

It’s not all good though.  What happens if we see more groups in the database?  or more months?  This worksheet doesn’t grow.   I know we can structure the workbook to make it expandable.  We can use dynamic ranges to populate the chart.   This is okay for me.  I’m a power user after all.  What about the other users?  Wouldn’t it be good if we could forget about structure and layout and just focus on the problem at hand as I’ve done here.

This is what R gives us.    I’ve not posted the R on this page as WordPress is a bit picky.  I wrote the code and published to a website using R Markdown.  R Markdown gives us a facility to write a report or slides and embed R directly in the doc.  You can find a copy of the R Markdown here and the published report here.

I hope you can see how R let’s us forget the structure. If you run the R on your machine you’ll get the same result.  Of course, this comes at the expense of the worksheet which gives R a steep learning curve.  Let me tell you though that once over the curve you have access to a new world of analysis capabilities.

Posted in Customer Insight, R | Tagged , | Leave a comment

Web Scraping with R

Web Scraping is used to pull data from web pages when an API is unavailable.  Imagine copying the data by hand (horrible chore) this is essentially web scraping.

I’ve wanted to get my head around this for a while and see if modern sites are structured in ways to make this easier.

When I was researching wine types for another project an opportunity came up.

Wine.com has a stellar dataset and they actually have a good API but I’ll ignore that for this demo.

The packages we can use to make our life easier are:

rvest   – this is from Hadley Wikham and makes scraping web data simple.

stringi – this comes from Marek Gagolewski, Bartek Tartanus et al and gives us many functions to manipulate text from any locale.  In Excel we have functions like Trim(), Left(), Right(), Mid() – well, after reading the docs for stringi my mind was truly blown.

There’s one other tool that’s recommended, Selectorgadget.

The first thing to do when scraping a page is to browse to the page.

http://www.wine.com/v6/wineshop/list.aspx?N=7155+2096&pagelength=25

We can then use the Selectorgadget tool to detect the right CSS selector. In the screenshot below I’ve clicked over the abstract and the class .productAbstract has been selected.  We know it’s the right selector as the other items in the list are highlighted.  A little patience is required here as there can be different selectors in the CSS which can give erroneous data.

selector

The other selector we need is .productList.

With this information we’re good to go.

The script starts with loading the libraries and setting the URL we found earlier. I’ve increased the pagelength for illustration purposes.

library(rvest)
library(stringi)

url <- "http://www.wine.com/v6/wineshop/list.aspx?N=7155+2096&pagelength=100"

To pull the web page into R’s memory we simply use :

page <- html(url)

We can then use the selector we found earlier to extract the data we’re interested in.

product_name <- page %>%
html_nodes(".listProductName") %>%
html_text()

products

We can do a similar thing to extract the Abstract.

abstract <-page %>%
html_nodes(".productAbstract") %>%
html_text()

abstract

That is not helpful at all.  We have \r\n and too much space.  Fear not, we can use stringi to clean up the data.   When I look at a problem like this I breakdown the problem into smaller bits and solve these individually.  So in this spirit we’ll handle the \r\n problem first.

stri_replace_all_regex(abstract, "\r\n", "")

This is simply replacing the string “\r\n” with empty.  Of course with Regex there are many ways to skin a cat.   If you want to learn more feel free.

We have spaces left to eliminate.  Notice I’m searching for double space as I don’t want to remove the space from the abstract itself.

stri_replace_all_regex("  ","")

We add the product and abstract columns together, add columns names and export to CSV.  Job done.

products <- cbind(product_name,abstract)
colnames(products) <- c("ProductName", "Variety", "Origin")
write.csv(products,file="Product.csv")

finalset

Complete code is listed below

library(rvest)
library(stringi)

url < -"http://www.wine.com/v6/wineshop/list.aspx?N=7155+2096&pagelength=300"

page <- html(url)

product_name <- page %>%
html_nodes(".listProductName") %>%
html_text() 

abstract <- page %>%
html_nodes(".productAbstract") %>%
html_text()

abstract <-(stri_replace_all_regex(abstract, "\r\n", "") %>%
stri_replace_all_regex(" ","") %>%
stri_split_fixed(" from ", simplify = TRUE))

products <- cbind(product_name,abstract)

colnames(products) <- c("ProductName", "Variety", "Origin")
write.csv(products,file="Product.csv")

16 lines of code.  In summary, we’ve scraped the page and selected the pertinent data and used basic Regex to clean up.    Data cleaning isn’t so bad after all 🙂

Posted in R | Tagged , , | 2 Comments