Everyone knows the importance of knowledge and how critical it is to progress. In today’s world, data is knowledge. But that’s only when the data is “good” and correctly interpreted. Let’s focus on the “good” part. What do we really mean by “good data”?
Its definition can change from use case to use case but, in general terms, good data can be defined by its accuracy, legitimacy, reliability, consistency, completeness, and availability.
Bad data can lead to failures in production systems, unexpected outputs, and wrong inferences, leading to poor business decisions.
It’s important to have something in place that can tell us about the quality of the data we have, how close it is to our expectations, and whether we can rely on it.
This is basically the problem we’re trying to solve.
The Problem and the Potential Solutions
A manual approach to data quality testing is definitely one of the solutions and can work well.
We’ll need to write code for computing various statistical measures, running them manually on different columns, maybe draw some plots, and then conduct some spot checks to see if there’s something not right or unexpected. The overall process can get tedious and time-consuming if we need to do it on a daily basis.
Certain tools can make life easier for us, like:
In this blog, we’ll be focussing on Amazon Deequ.
Amazon Deequ is an open-source tool developed and used at Amazon. It’s built on top of Apache Spark, so it’s great at handling big data. Deequ computes data quality metrics regularly, based on the checks and validations set, and generates relevant reports.
Deequ provides a lot of interesting features, and we’ll be discussing them in detail. Here’s a look at its main components:
Working with Deequ requires having Apache Spark up and running with Deequ as one of the dependencies.
As of this blog, the latest version of Deequ, 1.1.0, supports Spark 2.2.x to 2.4.x and Spark 3.0.x.
For learning more about Deequ and its features, we’ll be using an open-source IMDb dataset which has the following schema:
Here, tconst is the primary key, and the rest of the columns are pretty much self-explanatory.
Data Analysis and Validation
Before we start defining checks on the data, if we want to compute some basic stats on the dataset, Deequ provides us with an easy way to do that. They’re called metrics.
Deequ provides support for the following metrics:
Let’s go ahead and apply some metrics to our dataset.
We get the following output by running the code above:
Let’s try to quickly understand what this tells us.
- The dataset has 7,339,583 rows.
- The distinctness and uniqueness of the tconst column is 1.0, which means that all the values in the column are distinct and unique, which should be expected as it’s the primary key column.
- The averageRating column has a min of 1 and a max of 10 with a mean of 6.88 and a standard deviation of 1.39, which tells us about the variation in the average rating values across the data.
- The completeness of the averageRating column is 0.148, which tells us that we have an average rating available for around 15% of the dataset’s records.
- Then, we tried to see if there’s any correlation between the numVotes and averageRating column. This metric calculates the Pearson correlation coefficient, which has a value of 0.01, meaning there’s no correlation between the two columns, which is expected.
This feature of Deequ can be really helpful if we want to quickly do some basic analysis on a dataset.
Let’s move on to defining and running tests and checks on the data.
For writing tests for our dataset, we use Deequ’s VerificationSuite and add checks on attributes of the dataset.
Deequ has a big handy list of validators available to use, which are:
Let’s apply some checks to our dataset.
We have added some checks to our dataset, and the details about the check can be seen as comments in the above code.
We expect all checks to pass for our dataset except the containsURL and hasMax ones.
That’s because the titleType column doesn’t have URLs, and we know that the max rating is 10.0, but we are checking against 9.0.
We can see the output below:
In order to perform these checks, behind the scenes, Deequ calculated metrics that we saw in the previous section.
To look at the metrics Deequ computed for the checks we defined, we can use:
Automated Constraint Suggestion
Automated constraint suggestion is a really interesting and useful feature provided by Deequ.
Adding validation checks on a dataset with hundreds of columns or on a large number of datasets can be challenging. With this feature, Deequ tries to make our task easier. Deequ analyses the data distribution and, based on that, suggests potential useful constraints that can be used as validation checks.
Let’s see how this works.
This piece of code can automatically generate constraint suggestions for us:
Let’s look at constraint suggestions generated by Deequ:
We shouldn’t expect the constraint suggestions generated by Deequ to always make sense. They should always be verified before using.
This is because the algorithm that generates the constraint suggestions just works on the data distribution and isn’t exactly “intelligent.”
We can see that most of the suggestions generated make sense even though they might be really trivial.
For the endYear column, one of the suggestions is that endYear should be contained in a list of years, which indeed is true for our dataset. However, it can’t be generalized as every passing year, the value for endYear continues to increase.
But on the other hand, the suggestion that titleType can take the following values: 'tvEpisode,' 'short,' 'movie,' 'video,' 'tvSeries,' 'tvMovie,' 'tvMiniSeries,' 'tvSpecial,' 'videoGame,' and 'tvShort' makes sense and can be generalized, which makes it a great suggestion.
And this is why we should not blindly use the constraints suggested by Deequ and always cross-check them.
Something we can do to improve the constraint suggestions is to use the useTrainTestSplitWithTestsetRatio method in ConstraintSuggestionRunner.
It makes a lot of sense to use this on large datasets.
How does this work? If we use the config useTrainTestSplitWithTestsetRatio(0.1), Deequ would compute constraint suggestions on 90% of the data and evaluate the suggested constraints on the remaining 10%, which would improve the quality of the suggested constraints.
Deequ also supports anomaly detection for data quality metrics.
The idea behind Deequ's anomaly detection is that often we have a sense of how much change in certain metrics of our data can be expected. Say we are getting new data every day, and we know that the number of records we get on a daily basis are around 8 to 12k. On a random day, if we get 40k records, we know something went wrong with the data ingestion job or some other job didn’t go right.
Deequ will regularly store the metrics of our data in a MetricsRepository. Once that’s done, anomaly detection checks can be run. These compare the current values of the metrics to the historical values stored in the MetricsRepository, and that helps Deequ to detect anomalous changes that are a red flag.
One of Deequ’s anomaly detection strategies is the RateOfChangeStrategy, which limits the maximum change in the metrics by some numerical factor that can be passed as a parameter.
We learned about the main features and capabilities of AWS Lab’s Deequ.
It might feel a little daunting to people unfamiliar with Scala or Spark, but using Deequ is very easy and straightforward. Someone with a basic understanding of Scala or Spark should be able to work with Deequ’s primary features without any friction.
For someone who rarely deals with data quality checks, manual test runs might be a good enough option. However, for someone dealing with new datasets frequently, as in multiple times in a day or a week, using a tool like Deequ to perform automated data quality testing makes a lot of sense in terms of time and effort.
We hope this article helped you get a deep dive into data quality testing and using Deequ for these types of engineering practices.