Sometimes we bump into problems that we can’t solve on our own. When this happens you can walk away defeated or think of it as an opportunity to learn something new and gain an understanding of your limits.

There are many ways to learn but the one I find most effective is learning to solve a specific problem by engaging with the community.

A good friend of mine asked me for help to create a chart that is essentially a scatter chart with an ellipse enclosing the points :

The screen above shows how we can do this in Excel by hand drawing an ellipse to cover the points of a scatter chart.

Not good enough !

It turns out this visualisation problem is also a computer science problem that is studied by a large community of experts.

Hmm. This isn’t a computer science blog and I’m not a computer scientist. How can we bridge this gap?

I’ve been using R in the office to solve data analysis problems. One of the benefits of R is the sheer volume of packages that can be used to extend its functionality, 5002 packages at the time of writing and it’s growing constantly.

I must give thanks to :

Martin Maechler [aut, cre], Peter Rousseeuw [aut] (Fortran original), Anja Struyf [aut] (S original), Mia Hubert [aut] (S original), Kurt Hornik [trl, ctb] (port to R; maintenance(1999-2000)), Matthias Studer [ctb], Pierre Roudier [ctb]

The group of people above authored an R package called Cluster. Within the Cluster package we find a function called Ellipsoidhull.

Before we dive into the code, I have to recommend using R Studio to interact with R. R Studio provides a very user-friendly environment for interacting with R.

Press CTRL+SHIFT+N to create a new R script.

Paste the code into the window in the top left. The window in the bottom left shows you the R that is executed.

Here’s the complete code with comments. You can download the script and data here.

#Activate the package library(cluster) # Load the data from CSV files into R objects xy <- read.csv("c:/R/Ellipsoid/PageViews.csv") # Converts the object into a Matrix # We need a Matrix later on in the code. xy <-cbind(xy$PageViews, xy$Conversions) # I've loaded a second dataset to show some additional # R functionalility. # In practice you'd refactor the data into 1 file. za <- read.csv("c:/R/Ellipsoid/PageViews_2014.csv") za <-cbind(za$PageViews, za$Conversions) # For each of the series we call ellipsoidhull # Ellipsoidhull gives us the minimum hull covering all points exy <- ellipsoidhull(xy) exy2 <- ellipsoidhull(za) # We call predict to get the actual points that enable the ellipse # to be drawn series1 = predict(exy) series2 = predict(exy2) # We need to calculate the minimum and maximum points to set the correct Axis maxx <- max(series1[,1]) maxy <- max(series1[,2]) minx <- min(series1[,1]) miny <- min(series1[,2]) maxx1 <- max(series2[,1]) maxy1 <- max(series2[,2]) minx1 <- min(series2[,1]) miny1 <- min(series2[,2]) #We can carry out conditional logic in R with IF #It's used here to determine which max/min should #be used to scale the axis. if(maxx1 > maxx) { maxx <- maxx1 } if(maxy1 > maxy) { maxy <- maxy1 } if(minx1 < minx) { minx <- minx1 } if(miny1 < miny) { miny <- miny1 } # Par enables the setting of parameters for output # Here we're setting a margin for the chart to enable the legend # to be offset par(mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE) # This creates the initial chart with axis and labels # xy : the data matrix of points # col = sets the colour of the path # xlab, ylab, main = text labels # I decided to not plot the points as it makes the chart too busy # type="n" : plot nothing # type="p" : plot points # xlim, ylim sets the range for the axis # cex scales text on the chart. Here I'm reducing the size by 10% plot(xy,col="blue", xlab="Page Views", ylab="Conversions", main = "Views vs Conversions", type='n', xlim=c(minx,maxx),ylim=c(miny,maxy), cex.lab=0.9, cex.axis=0.9, cex.main=0.9, cex.sub=0.9) # Plot the legend # inset allows the legend to be offset from the topright # lty, lwd sets the line type and width legend("topright",inset=c(-0.25,0),c("2013","2014"),cex=0.7,lty=c(1,1), lwd=c(2.5,2.5),col=c("blue","red")) #Plot the first ellipse using the coordinates in the series1 matrix lines(series1, col="blue") #Plot the centre point of the ellipse hull #pch 13 = symbol type points(rbind(exy$loc), col = "blue", cex = 1.5, pch = 13) #-------------------------------- #Repeat for the second hull lines(series2, col="Red") points(rbind(exy2$loc), col = "red", cex = 1.5, pch = 13)

Press the Source button to execute the script.

One of the benefits of using R vs Excel is that the scripts can be run on the server. With a little data integration this problem is solved.

If you want to learn more about R a good book that isn’t too heavy is R in Action.