Cleaning data with t-SQL

When creating BI models and looking into the guts of the ERP system we see data artifacts from user behavior or sometimes poor design which requires us to get out our cleaning gear.

https://www.flickr.com/photos/internetsense/6789942482

It goes without saying the best course of action is to clean data at source  – if only this could happen more often.

In this post I’ll show a little example I learnt recently to infer dates with T-SQL.

In the scenario below we have HR data showing employees at various positions over time.
The time frame of each position is represented with Valid From and Valid To dates.

The integrity of this data is crucial if the fields are going to be used in a join.

When the validTo date is flaky we can infer the date by looking at the validFrom
of the employees next position. Usually this requires a complex query with self-joins and other manners of madness.

T-SQL in SQL Server 2012+ gives us the LEAD function. We can use LEAD() to write a simple query with fewer lines of code. It works by partitioning the data into sets split by employee. Each set is sorted by validFrom so the next row gives us the date needed. I had to use DATEADD to subtract 1 from the next ValidTo date was there was an issue using LEAD with int and dates.

You can set up the data with the code below. I’ve added real ValidTo dates to show the inference is correct.

CREATE DATABASE [HR]
GO
USE HR;
CREATE TABLE [dbo].[employees](
	[id] [smallint] NOT NULL,
	[name] [nvarchar](50) NULL,
	[employeeId] [smallint] NULL,
	[validFrom] [date] NULL,
	[validTo] [date] NULL)


INSERT INTO [dbo].[employees]
	(id, name,employeeId, validFrom, validTo)

values   (1,'Nelson', '1001', '20150101' ,'20150410')
		,(2,'Nelson', '1001', '20150411' ,'')
		,(3,'John', '1002', '20150301' ,'20151002')
		,(4,'John', '1002', '20151003' ,'')
		,(5,'Paul', '1003', '20150101' ,'20150410')
		,(6,'Paul', '1003', '20150411' ,'20160410')
		,(7,'Paul', '1003', '20160411','')
		,(8,'George', '1004', '20150310' ,'20160612')
		,(9,'George', '1004', '20160613' ,'20170410')
		,(10,'George', '1004', '20170411' ,'')
GO

SELECT id, name, employeeId,validFrom,validTo,
	    DATEADD(DAY,-1,LEAD(validFrom,1,'19000102') 
			OVER (PARTITION BY employeeId ORDER BY validFrom)) as InferedValidTo
FROM [dbo].[employees]
ORDER BY employeeId;

tsql-lead

About Lee Hawthorn

Business Intelligence Consultant
This entry was posted in Data Quality and tagged . Bookmark the permalink.

Leave a Reply