Unit Testing Data at Scale using Deequ and Apache Spark

Nishant Arora

Data Engineering

Tags:

Deequ

Apache Spark

Scala

AWS

Data Engineering

Everyone knows the importance of knowledge and how critical it is to progress. In today’s world, data is knowledge. But that’s only when the data is “good” and correctly interpreted. Let’s focus on the “good” part. What do we really mean by “good data”?

Its definition can change from use case to use case but, in general terms, good data can be defined by its accuracy, legitimacy, reliability, consistency, completeness, and availability.

Bad data can lead to failures in production systems, unexpected outputs, and wrong inferences, leading to poor business decisions.

It’s important to have something in place that can tell us about the quality of the data we have, how close it is to our expectations, and whether we can rely on it.

This is basically the problem we’re trying to solve.

The Problem and the Potential Solutions

A manual approach to data quality testing is definitely one of the solutions and can work well.

We’ll need to write code for computing various statistical measures, running them manually on different columns, maybe draw some plots, and then conduct some spot checks to see if there’s something not right or unexpected. The overall process can get tedious and time-consuming if we need to do it on a daily basis.

Certain tools can make life easier for us, like:

In this blog, we’ll be focussing on Amazon Deequ.

Amazon Deequ

Amazon Deequ is an open-source tool developed and used at Amazon. It’s built on top of Apache Spark, so it’s great at handling big data. Deequ computes data quality metrics regularly, based on the checks and validations set, and generates relevant reports.

Deequ provides a lot of interesting features, and we’ll be discussing them in detail. Here’s a look at its main components:

Source: AWS

Prerequisites

Working with Deequ requires having Apache Spark up and running with Deequ as one of the dependencies.

As of this blog, the latest version of Deequ, 1.1.0, supports Spark 2.2.x to 2.4.x and Spark 3.0.x.

Sample Dataset

For learning more about Deequ and its features, we’ll be using an open-source IMDb dataset which has the following schema:

CODE: https://gist.github.com/velotiotech/1669a46152646538d81087bc03033fc9.js

Here, tconst is the primary key, and the rest of the columns are pretty much self-explanatory.

Data Analysis and Validation

Before we start defining checks on the data, if we want to compute some basic stats on the dataset, Deequ provides us with an easy way to do that. They’re called metrics.

Deequ provides support for the following metrics:

CODE: https://gist.github.com/velotiotech/8d50987210ec8e2fffffa9cb5372bcc6.js

Let’s go ahead and apply some metrics to our dataset.

CODE: https://gist.github.com/velotiotech/21d701e7bc2e1e1a84a4af59a153da0f.js

We get the following output by running the code above:

CODE: https://gist.github.com/velotiotech/e41cafa35229d18a5bfa0b8a4f715df2.js

Let’s try to quickly understand what this tells us.

The dataset has 7,339,583 rows.
The distinctness and uniqueness of the tconst column is 1.0, which means that all the values in the column are distinct and unique, which should be expected as it’s the primary key column.
The averageRating column has a min of 1 and a max of 10 with a mean of 6.88 and a standard deviation of 1.39, which tells us about the variation in the average rating values across the data.
The completeness of the averageRating column is 0.148, which tells us that we have an average rating available for around 15% of the dataset’s records.
Then, we tried to see if there’s any correlation between the numVotes and averageRating column. This metric calculates the Pearson correlation coefficient, which has a value of 0.01, meaning there’s no correlation between the two columns, which is expected.

This feature of Deequ can be really helpful if we want to quickly do some basic analysis on a dataset.

Let’s move on to defining and running tests and checks on the data.

Data Validation

For writing tests for our dataset, we use Deequ’s VerificationSuite and add checks on attributes of the dataset.

Deequ has a big handy list of validators available to use, which are:

CODE: https://gist.github.com/velotiotech/f91b134406e91238d3c83a3789dc5ae9.js

Let’s apply some checks to our dataset.

CODE: https://gist.github.com/velotiotech/7be3f824a21f01db883b147020031966.js

We have added some checks to our dataset, and the details about the check can be seen as comments in the above code.

We expect all checks to pass for our dataset except the containsURL and hasMax ones.

That’s because the titleType column doesn’t have URLs, and we know that the max rating is 10.0, but we are checking against 9.0.

We can see the output below:

CODE: https://gist.github.com/velotiotech/4401083669a206183c0fb6d19f4aa566.js

In order to perform these checks, behind the scenes, Deequ calculated metrics that we saw in the previous section.

To look at the metrics Deequ computed for the checks we defined, we can use:

CODE: https://gist.github.com/velotiotech/9a41ca555ad5cebead755055f710b919.js

Automated Constraint Suggestion

Automated constraint suggestion is a really interesting and useful feature provided by Deequ.

Adding validation checks on a dataset with hundreds of columns or on a large number of datasets can be challenging. With this feature, Deequ tries to make our task easier. Deequ analyses the data distribution and, based on that, suggests potential useful constraints that can be used as validation checks.

Let’s see how this works.

This piece of code can automatically generate constraint suggestions for us:

CODE: https://gist.github.com/velotiotech/fdac31b374024041672780f77bedf5d7.js

Let’s look at constraint suggestions generated by Deequ:

CODE: https://gist.github.com/velotiotech/703f9c7987d0a4b1cd0672d47c65d166.js

We shouldn’t expect the constraint suggestions generated by Deequ to always make sense. They should always be verified before using.

This is because the algorithm that generates the constraint suggestions just works on the data distribution and isn’t exactly “intelligent.”

We can see that most of the suggestions generated make sense even though they might be really trivial.

For the endYear column, one of the suggestions is that endYear should be contained in a list of years, which indeed is true for our dataset. However, it can’t be generalized as every passing year, the value for endYear continues to increase.

But on the other hand, the suggestion that titleType can take the following values: 'tvEpisode,' 'short,' 'movie,' 'video,' 'tvSeries,' 'tvMovie,' 'tvMiniSeries,' 'tvSpecial,' 'videoGame,' and 'tvShort' makes sense and can be generalized, which makes it a great suggestion.

And this is why we should not blindly use the constraints suggested by Deequ and always cross-check them.

Something we can do to improve the constraint suggestions is to use the useTrainTestSplitWithTestsetRatio method in ConstraintSuggestionRunner.
It makes a lot of sense to use this on large datasets.

How does this work? If we use the config useTrainTestSplitWithTestsetRatio(0.1), Deequ would compute constraint suggestions on 90% of the data and evaluate the suggested constraints on the remaining 10%, which would improve the quality of the suggested constraints.

Anomaly Detection

Deequ also supports anomaly detection for data quality metrics.

The idea behind Deequ's anomaly detection is that often we have a sense of how much change in certain metrics of our data can be expected. Say we are getting new data every day, and we know that the number of records we get on a daily basis are around 8 to 12k. On a random day, if we get 40k records, we know something went wrong with the data ingestion job or some other job didn’t go right.

Deequ will regularly store the metrics of our data in a MetricsRepository. Once that’s done, anomaly detection checks can be run. These compare the current values of the metrics to the historical values stored in the MetricsRepository, and that helps Deequ to detect anomalous changes that are a red flag.

One of Deequ’s anomaly detection strategies is the RateOfChangeStrategy, which limits the maximum change in the metrics by some numerical factor that can be passed as a parameter.

Deequ supports other strategies that can be found here. And code examples for anomaly detection can be found here.

Conclusion

We learned about the main features and capabilities of AWS Lab’s Deequ.

It might feel a little daunting to people unfamiliar with Scala or Spark, but using Deequ is very easy and straightforward. Someone with a basic understanding of Scala or Spark should be able to work with Deequ’s primary features without any friction.

For someone who rarely deals with data quality checks, manual test runs might be a good enough option. However, for someone dealing with new datasets frequently, as in multiple times in a day or a week, using a tool like Deequ to perform automated data quality testing makes a lot of sense in terms of time and effort.

We hope this article helped you get a deep dive into data quality testing and using Deequ for these types of engineering practices.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Unit Testing Data at Scale using Deequ and Apache Spark

Its definition can change from use case to use case but, in general terms, good data can be defined by its accuracy, legitimacy, reliability, consistency, completeness, and availability.

Bad data can lead to failures in production systems, unexpected outputs, and wrong inferences, leading to poor business decisions.

It’s important to have something in place that can tell us about the quality of the data we have, how close it is to our expectations, and whether we can rely on it.

This is basically the problem we’re trying to solve.

The Problem and the Potential Solutions

A manual approach to data quality testing is definitely one of the solutions and can work well.

Certain tools can make life easier for us, like:

In this blog, we’ll be focussing on Amazon Deequ.

Amazon Deequ

Deequ provides a lot of interesting features, and we’ll be discussing them in detail. Here’s a look at its main components:

Source: AWS

Prerequisites

Working with Deequ requires having Apache Spark up and running with Deequ as one of the dependencies.

As of this blog, the latest version of Deequ, 1.1.0, supports Spark 2.2.x to 2.4.x and Spark 3.0.x.

Sample Dataset

For learning more about Deequ and its features, we’ll be using an open-source IMDb dataset which has the following schema:

CODE: https://gist.github.com/velotiotech/1669a46152646538d81087bc03033fc9.js

Here, tconst is the primary key, and the rest of the columns are pretty much self-explanatory.

Data Analysis and Validation

Before we start defining checks on the data, if we want to compute some basic stats on the dataset, Deequ provides us with an easy way to do that. They’re called metrics.

Deequ provides support for the following metrics:

CODE: https://gist.github.com/velotiotech/8d50987210ec8e2fffffa9cb5372bcc6.js

Let’s go ahead and apply some metrics to our dataset.

CODE: https://gist.github.com/velotiotech/21d701e7bc2e1e1a84a4af59a153da0f.js

We get the following output by running the code above:

CODE: https://gist.github.com/velotiotech/e41cafa35229d18a5bfa0b8a4f715df2.js

Let’s try to quickly understand what this tells us.

The dataset has 7,339,583 rows.
The distinctness and uniqueness of the tconst column is 1.0, which means that all the values in the column are distinct and unique, which should be expected as it’s the primary key column.
The averageRating column has a min of 1 and a max of 10 with a mean of 6.88 and a standard deviation of 1.39, which tells us about the variation in the average rating values across the data.
The completeness of the averageRating column is 0.148, which tells us that we have an average rating available for around 15% of the dataset’s records.
Then, we tried to see if there’s any correlation between the numVotes and averageRating column. This metric calculates the Pearson correlation coefficient, which has a value of 0.01, meaning there’s no correlation between the two columns, which is expected.

This feature of Deequ can be really helpful if we want to quickly do some basic analysis on a dataset.

Let’s move on to defining and running tests and checks on the data.

Data Validation

For writing tests for our dataset, we use Deequ’s VerificationSuite and add checks on attributes of the dataset.

Deequ has a big handy list of validators available to use, which are:

CODE: https://gist.github.com/velotiotech/f91b134406e91238d3c83a3789dc5ae9.js

Let’s apply some checks to our dataset.

CODE: https://gist.github.com/velotiotech/7be3f824a21f01db883b147020031966.js

We have added some checks to our dataset, and the details about the check can be seen as comments in the above code.

We expect all checks to pass for our dataset except the containsURL and hasMax ones.

That’s because the titleType column doesn’t have URLs, and we know that the max rating is 10.0, but we are checking against 9.0.

We can see the output below:

CODE: https://gist.github.com/velotiotech/4401083669a206183c0fb6d19f4aa566.js

In order to perform these checks, behind the scenes, Deequ calculated metrics that we saw in the previous section.

To look at the metrics Deequ computed for the checks we defined, we can use:

CODE: https://gist.github.com/velotiotech/9a41ca555ad5cebead755055f710b919.js

Automated Constraint Suggestion

Automated constraint suggestion is a really interesting and useful feature provided by Deequ.

Let’s see how this works.

This piece of code can automatically generate constraint suggestions for us:

CODE: https://gist.github.com/velotiotech/fdac31b374024041672780f77bedf5d7.js

Let’s look at constraint suggestions generated by Deequ:

CODE: https://gist.github.com/velotiotech/703f9c7987d0a4b1cd0672d47c65d166.js

We shouldn’t expect the constraint suggestions generated by Deequ to always make sense. They should always be verified before using.

This is because the algorithm that generates the constraint suggestions just works on the data distribution and isn’t exactly “intelligent.”

We can see that most of the suggestions generated make sense even though they might be really trivial.

And this is why we should not blindly use the constraints suggested by Deequ and always cross-check them.

Anomaly Detection

Deequ also supports anomaly detection for data quality metrics.

One of Deequ’s anomaly detection strategies is the RateOfChangeStrategy, which limits the maximum change in the metrics by some numerical factor that can be passed as a parameter.

Deequ supports other strategies that can be found here. And code examples for anomaly detection can be found here.

Conclusion

We learned about the main features and capabilities of AWS Lab’s Deequ.

We hope this article helped you get a deep dive into data quality testing and using Deequ for these types of engineering practices.

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

Explore current openings

Velotio Technologies is an outsourced software product development partner for top technology startups and enterprises. We partner with companies to design, develop, and scale their products. Our work has been featured on TechCrunch, Product Hunt and more.

We have partnered with our customers to built 90+ transformational products in areas of edge computing, customer data platforms, exascale storage, cloud-native platforms, chatbots, clinical trials, healthcare and investment banking.

Since our founding in 2016, our team has completed more than 90 projects with 220+ employees across the following areas:

Building web/mobile applications
Architecting Cloud infrastructure and Data analytics platforms
Designing AI/ML-based solutions
Intelligent Chatbots

Talk to us

Unit Testing Data at Scale using Deequ and Apache Spark

Nishant Arora

The Problem and the Potential Solutions

Amazon Deequ

Prerequisites

Data Analysis and Validation

Automated Constraint Suggestion

Anomaly Detection

Conclusion

MORE POSTS BY THIS AUTHOR

Nishant Arora

You may also like

Confluent Kafka vs. Amazon Managed Streaming for Apache Kafka (AWS MSK) vs. on-premise Kafka

Nitesh Jangir

Mage: Your New Go-To Tool for Data Orchestration

Shreyash Panchal

The Data Lake Revolution: Unleashing the Power of Delta Lake

Abhishek Sharma

Unit Testing Data at Scale using Deequ and Apache Spark

The Problem and the Potential Solutions

Amazon Deequ

Prerequisites

Data Analysis and Validation

Automated Constraint Suggestion

Anomaly Detection

Conclusion

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

About Velotio

Subscribe to get the latest technology updates

Related Posts

Services

By Company Stage

By Engagement Model

Expertise

Product Engineering

Data and AI

Cloud & DevOps

Strategy and Consulting

Subscribe to get the latest technology updates

Unit Testing Data at Scale using Deequ and Apache Spark

Nishant Arora

The Problem and the Potential Solutions

Amazon Deequ

Prerequisites

Data Analysis and Validation

Automated Constraint Suggestion

Anomaly Detection

Conclusion

MORE POSTS BY THIS AUTHOR

Nishant Arora

You may also like

Confluent Kafka vs. Amazon Managed Streaming for Apache Kafka (AWS MSK) vs. on-premise Kafka

Nitesh Jangir

Mage: Your New Go-To Tool for Data Orchestration

Shreyash Panchal

The Data Lake Revolution: Unleashing the Power of Delta Lake

Abhishek Sharma

Unit Testing Data at Scale using Deequ and Apache Spark

The Problem and the Potential Solutions

Amazon Deequ

Prerequisites

Data Analysis and Validation

Automated Constraint Suggestion

Anomaly Detection

Conclusion

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

About Velotio

Subscribe to get the latest technology updates

Related Posts

Confluent Kafka vs. Amazon Managed Streaming for Apache Kafka (AWS MSK) vs. on-premise Kafka

Mage: Your New Go-To Tool for Data Orchestration

The Data Lake Revolution: Unleashing the Power of Delta Lake

Unlocking the Potential of Knowledge Graphs: Exploring Graph Databases

Spatial Data Analytics : The What, Why, and How?

Apache Flink - A Solution for Real-Time Analytics

An Introduction to Stream Processing & Analytics

Modern Data Stack: The What, Why and How?

Best Practices for Kafka Security

Parallelizing Heavy Read and Write Queries to SQL Datastores using Spark and more!

ClickHouse - The Newest Data Store in Your Big Data Arsenal

How to Load Unstructured Data into Apache Hive

Building an ETL Workflow Using Apache NiFi and Hive

Elasticsearch - Basic and Advanced Concepts

BigQuery 101: All the Basics You Need to Know

Your Quintessential Guide to AWS Athena

Real Time Analytics for IoT Data using Mosquitto, AWS Kinesis and InfluxDB

Lessons Learnt While Building an ETL Pipeline for MongoDB & Amazon Redshift Using Apache Airflow

The Ultimate Beginner’s Guide to Jupyter Notebooks

Product Engineering

Data and AI

Cloud & DevOps

Strategy and Consulting