Data Analytics modelling, why tune by hand?

When we’re carrying out analysis once we’ve got clean transformed data we have to create a model.

There are many types of models that can be used depending on the type of analysis or prediction being made.  For instance, predicting a class, predicting values,  finding unusual points.

Within each collection of models I’d really like to be able to spin through the models and selectively apply each to my dataset.   I want to see the Accuracy, p-Value, Sensitivity, Specificity etc.. ranked.

With the model algorithms already pre-baked why can’t we just consume them in a fairly efficient way?

Of course we can do this by hand with Python or R but it would be much better if the software handled this type of plumbing/set-up.

Here’s an example of doing it with R from Suraj V Vidyadaran.  He cycles through 17 classification algorithms applying them and outputting a confusion matrix for each one.  This is a great resource for learning R but it also shows how there are patterns in the modelling that be abstracted away, in my opinion.

 

 

Posted in Analytics, R | Leave a comment

Collecting requirements from users for a BI project

When holding a workshop to collect requirements for a BI project you can take two broad approaches.

  1. When customers know what they want it’s easier to talk dimensions and facts i.e. £ sales by customer, product, brand, time.  Or £ Operating Expenses by Account, Department, Cost Center, Legal Entity.   If the users have the right subject knowledge they probably realise the £ Operating Expenses is not available by Product.
  2. Sometimes users don’t quite know what they want in each case you need to delve into processes.  For instance, take a Billing process.   It can be helpful to turn this into a process diagram as it will help tease out requirements.  It can also be used later to help document functional requirements when used in combination with an ERD.

The takeaway here : Don’t assume users know what they want.   Adapt your action to the situation at hand.

Posted in Business Analysis | Leave a comment

Business Analysis for BI projects

BI projects have complex elements such as dimensional modelling, modelling facts, hierarchies, calculations etc..

This makes it essential that Business Analysis is applied on all projects.  It’s during the BA phase that requirements are picked up.  As a BI professional it’s important we use our judgement to tease out requirements that users may not be aware of, for instance, slowly changing dimensions, late arriving facts, late arriving dimensions.

We can also start to detect data quality, although, data profiling is the key tool to determine data quality.  Users will often give their opinion on data quality.

We can also get an understanding of the largest fact tables and any fact grained dimensions.

For some BI projects there are limited facts, we see this with HR & Legal.  It’s very important to determine this early on.  This doesn’t prevent users from requesting facts, for instance, a hiring manager may wish to see the number of new hires or the number of applicants.  If this fact isn’t available it’s going to have to be created or inferred based on some other field such as Hire Date.

It’s also important we understand the process that underlies the OLTP database.   With BA tools we can document this process.   I had one situation where a billing process included dynamic revenue types.   The user wanted to analyse & report on revenue.   This required master data to be added to the ETL pipeline.  Not a small feat.    It’s things like this you want to know about early on.

There’s a lot more too that I won’t go into…currency translation, cost allocations, complex measures.

It’s vitally important these things are well understood before development starts.

Do you agree?

Posted in Business Analysis | Leave a comment

Cleaning data with t-SQL

When creating BI models and looking into the guts of the ERP system we see data artifacts from user behavior or sometimes poor design which requires us to get out our cleaning gear.

https://www.flickr.com/photos/internetsense/6789942482

It goes without saying the best course of action is to clean data at source  – if only this could happen more often.

In this post I’ll show a little example I learnt recently to infer dates with T-SQL.

In the scenario below we have HR data showing employees at various positions over time.
The time frame of each position is represented with Valid From and Valid To dates.

The integrity of this data is crucial if the fields are going to be used in a join.

When the validTo date is flaky we can infer the date by looking at the validFrom
of the employees next position. Usually this requires a complex query with self-joins and other manners of madness.

T-SQL in SQL Server 2012+ gives us the LEAD function. We can use LEAD() to write a simple query with fewer lines of code. It works by partitioning the data into sets split by employee. Each set is sorted by validFrom so the next row gives us the date needed. I had to use DATEADD to subtract 1 from the next ValidTo date was there was an issue using LEAD with int and dates.

You can set up the data with the code below. I’ve added real ValidTo dates to show the inference is correct.

CREATE DATABASE [HR]
GO
USE HR;
CREATE TABLE [dbo].[employees](
	[id] [smallint] NOT NULL,
	[name] [nvarchar](50) NULL,
	[employeeId] [smallint] NULL,
	[validFrom] [date] NULL,
	[validTo] [date] NULL)


INSERT INTO [dbo].[employees]
	(id, name,employeeId, validFrom, validTo)

values   (1,'Nelson', '1001', '20150101' ,'20150410')
		,(2,'Nelson', '1001', '20150411' ,'')
		,(3,'John', '1002', '20150301' ,'20151002')
		,(4,'John', '1002', '20151003' ,'')
		,(5,'Paul', '1003', '20150101' ,'20150410')
		,(6,'Paul', '1003', '20150411' ,'20160410')
		,(7,'Paul', '1003', '20160411','')
		,(8,'George', '1004', '20150310' ,'20160612')
		,(9,'George', '1004', '20160613' ,'20170410')
		,(10,'George', '1004', '20170411' ,'')
GO

SELECT id, name, employeeId,validFrom,validTo,
	    DATEADD(DAY,-1,LEAD(validFrom,1,'19000102') 
			OVER (PARTITION BY employeeId ORDER BY validFrom)) as InferedValidTo
FROM [dbo].[employees]
ORDER BY employeeId;

tsql-lead

Posted in Data Quality | Tagged | Leave a comment

Consuming R through API endpoints with Domino Data Lab

This is a follow-up to the post about deploying R models using web services.

Within 1 hour I was able to take an existing R function and publish to the Domino Data Lab web service AND write a simple Python script to call the service.   Domino make it super simple.

The R code I used was the Sales Price calculator I previously wrote about on Rpubs.

I stripped out any calling code so the R script uploaded to Domino was just a function.

Domino Project Screen

The next step was to create the API end point that calls the R function. This is as simple as pasting in the name of the R script along with the function name.

Domino End Points

With the API end point published it’s just a case of consuming it.  Domino provide sample code in lots of different languages.  I chose Python  (I’m currently learning Python).

The is a very simple example.

import unirest
import json
import yaml

business_name = "Nelsons"
rating =6
turnover = 20000
competitor_intensity = 5
months_trading = 12
product = "Maloja FadriM Multisport Jacket"
quantity = 20

response = unirest.post("https://app.dominodatalab.com/v1/leehbi/salesprice/endpoint",
    headers={
        "X-Domino-Api-Key": "Enter your private API key here",
        "Content-Type": "application/json"
    },
    params=json.dumps({
        "parameters": [business_name, rating, turnover, competitor_intensity, months_trading, product, quantity]
    })
)

response_data = yaml.load(response.raw_body)
print (response_data['result'])

Python Output

The beauty of this – the code can be called from anywhere.  My account with Domino has the free instance but paid instances can be spun up with huge amounts of RAM.

There are so many options around now that help with the analytical workflow.  Of course, data preparation is still time consuming but we’re seeing tools and packages to make it more efficient.

How about taking a legacy spreadsheet.  Take the processing out to R and then publish it for consumption by a web app using Shiny or Domino.

In my past I’ve created many excel apps. I always wanted a way to publish them to save me from “file/open/refresh/email”.

We have different options today.

 

Posted in Analytics, Deployment, R | Tagged , | Leave a comment