Since I started my journey 18 months ago to learn about how statistics can be applied in the business context I’ve come across various platforms and applications that make this easier. Despite this I’ve not yet found a clean way of integrating our Data Warehouse data to statistical apps. I’m still searching but in the meantime let me introduce you to Python. Python is a very popular, easy to use, and open source programming language that is used all over the Internet.
This post covers a common business scenario where there is a need to understand how various sales drivers impact on the level of sales.
We can use a statistical term called Correlation to determine how the different sales drivers depend on each other. This is a very deep topic so for the purpose of this post all you need to know is that the output from Correlation is a value between +1 and -1 which gives an indication as to the strength of dependence between two variables, the closer the value is to 1/-1 the stronger the dependence.
You can download Python from http://www.python.org/download/ I used version 2.7 for this exercise.
If you want to follow along you’ll also need to download a couple of add-ins :
Numpy : http://numpy.scipy.org/ – This provides the statistical functions we need.
Matplotlib : http://matplotlib.org/ – This makes it very easy to create charts.
To begin with we acquire our business data. In the example below we have Sales volumes and measures that we think might impact sales : Employee Happiness, Number of Sales Managers, Number of Support Managers, Weather.
This data needs to be saved into a CSV file with no headers.
For this example, the CSV file is saved in the same folder that Python is installed in but in a production environment I would recommend you use a separate folder.
The next step is to create the Python code that will calculate the correlation of each variable. On my statistical journey I quickly learnt that the results of calculations are often large data sets that are difficult to interpret. To aid interpretation the recommended practice is to create a chart. At this point I’d forgive you for thinking this will require reams of code. Fear not.
Here’s the code :
# Code libraries from numpy import corrcoef from pylab import pcolor, show, colorbar, xticks, yticks, arange import csv # Load the CSV file into memory data = csv.reader(open("sales_vars.csv", "rU")) # This converts the data in memory into a List dat = list(data) # This calculates the correlation values # rowvar = 0 tells the function we have variables in columns # If the variables are held in rows you can remove this from the call D = corrcoef(dat,rowvar=0) # set parameters for the chart pcolor(D) colorbar() # This sets up the scale and ticks of the chart. yticks(arange(0.5,5.5),range(0,5)) xticks(arange(0.5,5.5),range(0,5)) # Display the chart show()
To run the code fire up a terminal and execute the Python script.
After a few seconds you should see a heat chart matrix that shows how strong the variable dependence is.
From this chart we can see the first 3 variables tell us where the action is.
Now you have ascertained this you can understand the relationships further using other statistical techniques such as Regression.
You can download a copy the CSV and Python script here.
I hope you’ve enjoyed this post. Please feel free to comment.