5 things data analyst need to check when ensuring data quality.
In a LinkedIn live session with George Firican, I asked, how do I get everyone to ensure the correctness and quality of data around them? His response was so encouraging.
You can access the video here.
If you would like to follow up with his weekly Good Data Morning Show, please kindly follow him on LinkedIn.
Year on year, bad data cost businesses and big industries millions of dollars.
Apart from the effect of poor data-driven decisions, the cost of man-hour wasted in cleaning data can be converted into something more important and productive in the organization.
If your ETL (Extract Transform Load) is not meticulously done, your EDA (Exploratory Data Analysis) will likely result in power or bad data output, hence ultimately affect your data quality.
In this article, we will explore 5 major exploratory data analyses you need to carry out when preparing your data for visualization.
They are:
- Missing values
- Values outside of an expected range
- Malformed values
- Values that violet business rules
- Inconsistent values across tables
Missing Values.
There are several reasons why we encounter missing values in a dataset. The notable of them is at the points of collection and extraction.
There are different data collection points ranging from simple front desk executives to a complex web application.
These collection points make light of some data perceived to be private like a person’s annual income and age, hence they make it optional.
You cannot blame someone for not giving out some data he or she thinks is private. But when making some analysis of the company’s data, then will those ignored data become important.
Naturally, when you make a field optional, humans always ignore it. SQL represents missing value as null, this is very tempting, as it makes some analysts ignore it, believing that its effect might be negligible.
Yes in some cases where a dataset has millions or billions of records, a few missing values may not affect the outcome of the analysis and visualization.
Let’s take a look at how we can take care of some missing values.
Note: Missing values are not always continuous values, there can be categorical values.
Deleting Missing Values
The most convenient way of handling missing values is to delete them and work with only values with complete records or columns.
We can choose to delete the rows affected or the columns affected. Row deletion is more popular than column deletion.
While it is so, none of them is an absolute fix to the problem. Deleting column happens mostly when there are several missing fields on the same column.
Another situation where these can happen is when there are joining of data when multiple data source.
This could give a column that may be useless to the main objective.
Replacing Missing Data
Another prominent way to fix missing values is to replace them with other related values.
For instance, for relatively small on-premise data, one can locally fill in the data by referring back to the data source (This is mostly for companies generating a small amount of data monthly).
Secondly, where there are no trends and the values are consistent, it might be safe to use the value above or below the missing value (This method might hide the trend if there is any).
A better way is to use boundary values to build a value for the missing value (boundary values are minimum and maximum values).
Another method for interpolating missing values is with central tendencies, this produces a very smooth analysis but the variance (variance in a simple term means the spread of the data) will definitely suffer.
When we experience categorical missing values, then the above statistical methods cannot be used.
You will need to figure out how to fill in the gap.
For example, when you have a gender missing value, most times, you may want to check with the first name, if the name is feminine, the chances of the missing value being female will be high.
There are several other ways to take care of missing values, this also depends on the attributes and type of value.
Values outside of an expected range
Imagine having the hired date of an employee being earlier than the start date of the company or the end date being in the future.
How about an employee’s age being 200years, this does not make any logical or business sense.
It is important to carefully check for the range event being described in the dataset.
In doing this, you need to compare dates and see if their differences look reasonable.
When working with numeric columns, you need to check that the range of values is within what would be expected.
For example, when an average sale of a product is $15, then you see a particular sale of $1000.
You don’t have to completely discard it, however, you can investigate and know what really happened.
At this point, it is clearly an outlier.
Standard deviation using the SQL STDDEV will help in finding out these outliers.
Please, outliers are not completely useless in a dataset.
This is the reason you must investigate them.
They could be a typographic error or a mirror for the next big thing for the business. So don’t be in a hurry to discard them.
Malformed values
Malformed values usually occur in categorical data, more precisely in strings. They are data that are not properly formatted.
Some of these malformed values are:
- Truncated names and descriptions of a column.
- Dates that are formatted with invalid characters.
- Text values that have no space between them or text that are wrongly spaced or multiple spaced.
This is important during your exploratory data analysis.
Imagine if you have to look for text that starts with “fl” and is six-letter words using wildcards “_” in SQL, if you have any text with space in between, you will end up getting some wrong or unwanted output.
You can fix them through the help of wildcards and SQL update function, you can read them up.
Values that violet business rules
Business rules are easy to spot out if the data analyst is an in house analyst or an analyst who is very strong and experienced in the said domain.
Yes, I once worked on a dataset of an Australian winery company, and in the column of the weight of a bottle of wine, I saw some inconsistent values like 80kg, 5g.
Although those values appropriately represent the weight, they are unlikely going to be a bottle of wine that will weigh as much as 85kg or as low as 5g.
I had to resort to locally handle the wrongly entered value.
Other things that indicate the business rule has been violated are:
- An order total that is greater than a customer’s credit limit.
- The date an order is shipped is earlier than the date it was ordered.
- A product ID in order refers to a product that is no longer available.
The list is inexhaustible, it mostly depends on the domain and the question statement.
Inconsistent values across tables.
When relational integrity in a database is not enforced, we have inconsistent values across tables.
Relational Integrity helps ensure the accuracy and consistency of data in a relational database.
For example, an ordered item that has an Order ID that does not exist in the order table.
An employee with a department but cannot be found in the department table.
These are some reasons to perform some data quality checks during your exploratory data analysis.
Conclusion
This is by no means all the checks an analyst or data engineer needs to ensure data quality, but it is a very good place to start.
The 5 checks above are most common in datasets.
In the ideal world, all datasets are clean but in the real world, no data is actually clean.
As a data analyst, it is important to have such a mindset, as it will help you to be conscious of always checking your data for quality before analysis and visualization.
There is no one method that fits it all in cleaning data, you will have to make your decisions on a case by case basis.
The more dirty data you clean, the better you become and the faster you clean subsequent datasets.
The best solution will ideally be not to give room for dirty data at all, unfortunately, such a scenario does not exist.
I will say the next best solution is to try to reduce the chances of poor data.
This can be achieved by being part of the data collection decision making process.
As I said, this list is inexhaustible, please kindly use the comment box and let us know your experience and the tools you use for your data cleaning and exploration.
You may also ask your question if you have. We are all here to learn.