Lack of open data makes for clunky democracy

Living in the city of Chester I sometimes think the old Deva Victrix (79AD) city is still in the air.

We have a 6 story luxury student block being planned for development at the end of my road.  When deciding to support / object to this development I wanted to know if this was based on real need.

Chester already has lots of student housing, both multiple occupancy homes and newer student blocks.  The latter being more expensive.

Like other university cities Chester has a superb community of students that bring energy and life to the city.  The University of Chester is an important asset to the city that helps power growth.

Student blocks are being developed by third-parties and sold to investors looking for returns at the expense of students/taxpayers.  I wanted to know if the development was speculative or for a real need.

After reaching out to my local authority I was shocked to discover there was no open data available on this controversial subject.  

We live in a data rich society.  It’s way past time we had this data in my opinion.  To answer my question  I really needed some level of ‘occupancy’ data.

I didn’t take me long to find it.

This data was stuck on a poorly formatted web site.   Available to all but not exactly formatted for easy access.  No API.  Can’t really blame the 3rd party running the website, they are doing the important job of matching students with housing after all.  Not sure what happened to the semantic web.

With Python on my Pi I wanted to see if it was possible to automate the scraping, processing, presentation.   A micro-BI project carried out over a a couple of Sunday mornings.

Scraping
One of the popular Python packages for scraping data is Beautiful Soup.  Developed by a librarian in the New York library!

Storing
It’s great to see SQL Server come back to Linux.   Would have liked to have used it but I needed something much lighter for my Pi.   I was keen to store this data with schema rather than dump it into JSON.
SqLite is so popular but I’ve never had a use for it.  It’s on most of our phones as mobile phone developers use it for local data persistence.  Thought I’d give it a try.  Can’t be hard, SQL is a standard after all.

Presentation
I would of loved to have used Power BI.  I didn’t bother looking for a Pi data gateway for SqLite on Linux.

On R we have excellent visualisation library’s such as GGPlot.  Wondered what was available on Python.  Bokeh seemed powerful at first glance.   It’s not at V1 yet but I expect it will become the Python GGPlot in time.

Hosting
Again Power BI allows for sharing – it would have been ridiculously easy to create a Power BI Dashboard.  But same reasons as above it was off limits.

Flask is a micro-web framework that seemed right up my street.  Are you seeing a trend here? Writing this code early on Sunday I had very little time.

The code is on github. Proc.py contains the scraping/db work.

You’ll find the Bokeh code in Dash.py

The Flask server is set-up in the web folder.   Note this code is not ready for production purposes although it’s been running on my Pi without fail.   If I was to take this off the Pi I would probably use Gunicorn & Nginx to serve the app using an Azure VM.

Incidentally, I did this development on my Windows machine using VS Code with Python Tools installed, it provides a very refreshing debugging/linting experience.  My only gripe was that I’d like to be able to see Pandas Data Frames more easily when debugging.  Seeing the stack is fun but too deep in many cases.  I suppose I’m spoilt by RStudio.
I was surprised that I didn’t need to modify the code (other than changing the working path) to get it running on the Pi.  This applied to the full stack.  Oh how times are changing.

Conclusion

The web data does indeed show there are empty properties spread across Chester.
This data should be available for society.  We shouldn’t need hacking skills to get the data in my opinion.

After I joined the local community action group I found that local authorities across the country have very disparate data services.  It seems the improvements we’ve seen from the Gov haven’t filtered down to our cash strapped Local Authorities yet.

Posted in Uncategorized | Leave a comment

Data Analytics modelling, why tune by hand?

When we’re carrying out analysis once we’ve got clean transformed data we have to create a model.

There are many types of models that can be used depending on the type of analysis or prediction being made.  For instance, predicting a class, predicting values,  finding unusual points.

Within each collection of models I’d really like to be able to spin through the models and selectively apply each to my dataset.   I want to see the Accuracy, p-Value, Sensitivity, Specificity etc.. ranked.

With the model algorithms already pre-baked why can’t we just consume them in a fairly efficient way?

Of course we can do this by hand with Python or R but it would be much better if the software handled this type of plumbing/set-up.

Here’s an example of doing it with R from Suraj V Vidyadaran.  He cycles through 17 classification algorithms applying them and outputting a confusion matrix for each one.  This is a great resource for learning R but it also shows how there are patterns in the modelling that be abstracted away, in my opinion.

 

 

Posted in Analytics, R | Leave a comment

Collecting requirements from users for a BI project

When holding a workshop to collect requirements for a BI project you can take two broad approaches.

  1. When customers know what they want it’s easier to talk dimensions and facts i.e. £ sales by customer, product, brand, time.  Or £ Operating Expenses by Account, Department, Cost Center, Legal Entity.   If the users have the right subject knowledge they probably realise the £ Operating Expenses is not available by Product.
  2. Sometimes users don’t quite know what they want in each case you need to delve into processes.  For instance, take a Billing process.   It can be helpful to turn this into a process diagram as it will help tease out requirements.  It can also be used later to help document functional requirements when used in combination with an ERD.

The takeaway here : Don’t assume users know what they want.   Adapt your action to the situation at hand.

Posted in Business Analysis | Leave a comment

Business Analysis for BI projects

BI projects have complex elements such as dimensional modelling, modelling facts, hierarchies, calculations etc..

This makes it essential that Business Analysis is applied on all projects.  It’s during the BA phase that requirements are picked up.  As a BI professional it’s important we use our judgement to tease out requirements that users may not be aware of, for instance, slowly changing dimensions, late arriving facts, late arriving dimensions.

We can also start to detect data quality, although, data profiling is the key tool to determine data quality.  Users will often give their opinion on data quality.

We can also get an understanding of the largest fact tables and any fact grained dimensions.

For some BI projects there are limited facts, we see this with HR & Legal.  It’s very important to determine this early on.  This doesn’t prevent users from requesting facts, for instance, a hiring manager may wish to see the number of new hires or the number of applicants.  If this fact isn’t available it’s going to have to be created or inferred based on some other field such as Hire Date.

It’s also important we understand the process that underlies the OLTP database.   With BA tools we can document this process.   I had one situation where a billing process included dynamic revenue types.   The user wanted to analyse & report on revenue.   This required master data to be added to the ETL pipeline.  Not a small feat.    It’s things like this you want to know about early on.

There’s a lot more too that I won’t go into…currency translation, cost allocations, complex measures.

It’s vitally important these things are well understood before development starts.

Do you agree?

Posted in Business Analysis | Leave a comment

Cleaning data with t-SQL

When creating BI models and looking into the guts of the ERP system we see data artifacts from user behavior or sometimes poor design which requires us to get out our cleaning gear.

https://www.flickr.com/photos/internetsense/6789942482

It goes without saying the best course of action is to clean data at source  – if only this could happen more often.

In this post I’ll show a little example I learnt recently to infer dates with T-SQL.

In the scenario below we have HR data showing employees at various positions over time.
The time frame of each position is represented with Valid From and Valid To dates.

The integrity of this data is crucial if the fields are going to be used in a join.

When the validTo date is flaky we can infer the date by looking at the validFrom
of the employees next position. Usually this requires a complex query with self-joins and other manners of madness.

T-SQL in SQL Server 2012+ gives us the LEAD function. We can use LEAD() to write a simple query with fewer lines of code. It works by partitioning the data into sets split by employee. Each set is sorted by validFrom so the next row gives us the date needed. I had to use DATEADD to subtract 1 from the next ValidTo date was there was an issue using LEAD with int and dates.

You can set up the data with the code below. I’ve added real ValidTo dates to show the inference is correct.

CREATE DATABASE [HR]
GO
USE HR;
CREATE TABLE [dbo].[employees](
	[id] [smallint] NOT NULL,
	[name] [nvarchar](50) NULL,
	[employeeId] [smallint] NULL,
	[validFrom] [date] NULL,
	[validTo] [date] NULL)


INSERT INTO [dbo].[employees]
	(id, name,employeeId, validFrom, validTo)

values   (1,'Nelson', '1001', '20150101' ,'20150410')
		,(2,'Nelson', '1001', '20150411' ,'')
		,(3,'John', '1002', '20150301' ,'20151002')
		,(4,'John', '1002', '20151003' ,'')
		,(5,'Paul', '1003', '20150101' ,'20150410')
		,(6,'Paul', '1003', '20150411' ,'20160410')
		,(7,'Paul', '1003', '20160411','')
		,(8,'George', '1004', '20150310' ,'20160612')
		,(9,'George', '1004', '20160613' ,'20170410')
		,(10,'George', '1004', '20170411' ,'')
GO

SELECT id, name, employeeId,validFrom,validTo,
	    DATEADD(DAY,-1,LEAD(validFrom,1,'19000102') 
			OVER (PARTITION BY employeeId ORDER BY validFrom)) as InferedValidTo
FROM [dbo].[employees]
ORDER BY employeeId;

tsql-lead

Posted in Data Quality | Tagged | Leave a comment