For the first of my ‘portfolio’ posts, I am going to discuss one of the major stumbling blocks that I, and many others starting out in fields such as business intelligence and data science, have come across. Data cleaning and preparation is among the most important parts of the project lifecycle for any business intelligence and data science project. It is estimated that 80% of the work of people working in these fields relates to data cleaning and preparation in some way. Unfortunately, it’s often overlooked in university programs, online courses and in learning materials in general, despite its obvious importance.
Often, introductory courses will look at more exciting parts of business intelligence and data science, such as data visualization and machine learning. To an extent, this is understandable. These topics are useful ‘hooks’ to get beginners started on interesting and engaging tasks. However, without learning how to clean and prepare data, thoroughly understanding and being able to work through all the stages of a project is not feasible. Insufficient data cleaning and preparation will also compromise the final results obtained. As the saying goes, ‘Garbage In – Garbage Out’.
In this section of the article, I will go through some general principles and best practices for data cleaning and preparation. While there are of course many more techniques and advanced concepts within this area, they are beyond the scope of this article. I intend for this to be a starting point people, who like myself, are new to fields related to data and who want to get an idea of how to clean and prepare data.
When a dataset is obtained, the first thing to do is an exploratory analysis of it. In this stage, you should get a feel for the data within it. One of the first things to look for when doing the exploratory analysis is to make sure that the entries are valid. For example, do the fields that require a number have a numerical entry? On a similar note, entries should also make sense within the dataset provided. This will require a bit of domain knowledge of the subject of the data. For example, if looking at a dataset of wages, do the amounts make sense? If the average value within the dataset is, say, $100,000, and there is an entry that is $1,000,000, there’s a good chance this is an incorrect entry. However, this is all dependent on the context of the dataset.
Duplicate and null entries are also a priority to check for during this stage. Particularly with larger datasets, these entries are likely to arise at some point. They can often be overlooked as they are not always as obvious to find, particularly at an initial glance of a dataset.
Wikipedia provides a useful summary of the dimensions of data quality. They are as follows:
- Validity (Do measures conform to defined rules or constraints?)
- Accuracy (Do measures conform to a standardized value?)
- Completeness (Are all the required measures known?)
- Consistency (Are the recorded measures the same across the dataset?)
- Uniformity (Does the dataset use the same units of measurement?)
An awareness of these factors of data quality and some preliminary work to ensure these are adhered to in the preparation and cleaning stages can vastly improve the final results of a project, as well as save a lot of time avoiding confusion and errors in later stages of a project. It can take some time to become accustomed to doing this and can be tedious at times, but establishing good practices of data cleaning and preparation is one of the most valuable things any beginner to business intelligence and data science can do.