This is the third article in a series which will discuss the importance of being data driven and provides a no-nonsense guide on how to best transition your organisation. This article details how to cleanse and consolidate. Later articles will discuss how to automate your processes and how to implement your dream data solution.

PART 3: START TO TRUST YOUR DATA

Clean, trustworthy data is essential for a company’s digital health and is fundamental for data driven success. But that doesn’t mean throwing away all your existing data. An apple tree isn’t worthless just because of a few bad apples. The first step to allow you to trust your data is to do a data cleanse and remove those bad apples.

Data Cleanse

Poor quality data isn’t a reflection of your organisation or employees, it is simply a fact of life. Where data isn’t being reported on and interrogated, omissions and errors are missed, and poor data entry habits can become established. As you leverage your data, these issues will be addressed, and the data quality will naturally improve over time.

There are seven types of dirty data that you may come across:

Duplicate - e.g., John Smith & John Smith
Inaccurate - e.g., Fohn Smith
Incomplete - e.g., J. Smith
Inconsistent - e.g., John Smith & Jon Smith
Non-uniform - e.g., 2022-02-03 & 3rd Feb 2022
Excess - e.g., John Paul Logan Richard Smith
Non-GDPR compliant - e.g., John Smith & Cis Male

Duplicate, inaccurate and incomplete data are all reasonably self-explanatory. But there are other kinds which are just as insidious.

Inconsistent data is probably going to cause you the biggest headache. This is where two data entries, which may or may not be on different systems, are in direct conflict. For example, this could be two dates of birth which are different. Depending on how critical this data is, will determine whether you can apply a simplistic approach, such as always trusting the data from one system when there is a conflict. If it is critical that the data is 100% accurate, then you may need to cleanse your data row by row, which will be time consuming and expensive.

An example of non-uniform data could be dates which are stored in different formats. One common problem to watch out for is where dates are stored in an American format (mm-dd-yyyy). If they are combined with dates of the format dd-mm-yyyy, you can create significant problems. For this reason, I would always advise storing dates in international format (yyyy-mm-dd) to avoid any confusion.

The final two types of dirty data, excess and non-GDPR compliant are linked. If you are storing data that isn’t relevant to your current or future needs, then at best you are complicating your data store and at worst you could be breaking the law.

There are many ways to cleanse data. First determine how far back you want to cleanse? You may have 20 years’ worth of data, but is it all relevant? Will the past three or five years suffice? Consider working backwards in manageable chunks and I would always prioritise systems based on how mission critical they are and whether they are considered to be a Source of Truth.

It may be possible to cleanse using macro-editing. For example, if you are cleaning addresses and have data entries with OXON as the county, you could change all of these to Oxfordshire in one go.

However, you will almost certainly need to perform micro-editing where you clean up individual entries.

Of all of the dirty data types, inaccurate data is going to be the most difficult to spot. It generally requires a combination of visual inspection and rules-based processing.

However, one option is to use a scatter plot and to look for outliers which could be potential errors. In our example below, there is a clear outlier on the right-hand side, which looks very different to the rest of the data. I have put a square box around it to highlight the dot. It could be legitimate but is most likely a rouge data point.

Once you have cleansed your data, you want it to stay that way. You should implement data validation rules on your data input and remember to actively search for new dirty data that has crept into your systems. For example, you can build exception reports which use rules to highlight potential data entry errors.

A technology which may be worth considering is Master Data Management (MDM). There are various software packages available, but they work on the premise of consolidating data to create “golden records”. These are accurate, trusted records that can relied upon by the organisation. Where the software detects a potential conflict or error, the record is quarantined until a data steward can review the record and correct any problems. This keeps your data clean and reduces the risk of errors creeping in.

Integrate

From the systems and data sources you have identified, how will you consolidate your data? Or in fact, do you even need to do so?

Traditionally, companies have opted for a data warehouse solution to gather and aggregate data from multiple sources for reporting purposes. This is still a viable solution, but other less costly options may be a better fit for your organisation.

Data warehouses typically use a process known as ETL, or Extract, Transform, Load. It pretty much does what it says on the tin. You first extract the data from your source system, use various functions to transform and refine the data into the format you want and then finally load it into your data warehouse.

Data warehouses only handle fully structured or relational data. However, data can come in many different formats and sizes. Companies often have large quantities of unstructured data such as Word documents and PDF files.

This has led to a new technology in recent years, and you may have come across the concept of a Data Lake. Data lake storage is designed for fault-tolerance, infinite scalability, and high-throughput ingestion of data with varying shapes and sizes. In other words, you can pretty much store any type of digital data within a Data Lake.

The danger here is that your Data Lake simply becomes a dumping ground, with no thought into how that data will be used in the future. This can rapidly turn your pristine Data Lake into a polluted and dirty Data Swamp.

Another option is to do away with a centralised data repository and consolidate your data within a data visualisation tool instead. Here we can employ an alternative to the ETL process we discussed earlier. That is ELT, or Extract, Load, Transform. This is a good option for organisations with a moderate amount of data who don’t want the complexities and cost of a data warehouse. With ELT, you simply load the raw data into your data analytics software and transform it before visualisation.

Ultimately, there is no simple “one size fits all” solution and the route you take will be governed by your own unique circumstances. However, I would urge you to think carefully about the options before making a costly mistake.

Whichever approach you take, a useful first step is to draw a Venn Diagram showing how, or if, the data in your systems overlaps. If we take the example below, we can see clearly how the data in four of the systems overlap, but the facilities system is completely stand alone. There would be limited value in trying to consolidate the data from this system in a centralised data store.

As I mentioned earlier, if you have complex data flows involving multiple systems, particularly where that data needs to be kept synchronised, then it may be worth looking at iPaaS software. iPaaS stands for integrated Platform as a Service and is a set of automated tools that integrate software applications in different environments and flows data between them.

In effect the iPaaS acts as a central data hub. All of your systems can communicate with each other by passing messages through the hub.

One powerful feature of using an iPaaS architecture is the ability to “Plug & Play”. That is, it is relatively easy to replace a system when it comes to end of life. As long as the new system which replaces it has the same inputs and outputs, it will have no impact on the rest of the systems in your environment.

To Conclude:

These methods will undoubtedly help you improve your data quality and allow you to trust your data, to keep your digital health in good condition, and to set up the launching pad for your data-driven path to success.

As with my metaphor with the apple tree earlier, once you have removed all of your bad apples, you will be ready to harvest!

If you have any questions or wish to discuss this topic in more detail, then you are more than welcome to contact me with any of the following contact details at the end of this article.

Data Driven Success

PART 3: START TO TRUST YOUR DATA

Data Cleanse

Integrate

To Conclude:

Data Driven Success

What is Business Process Automation?