Data QA: The Need of the Hour

Rita Kushwaha

Data Engineering

Tags:

Data quality

Great Expectations

Have you ever encountered vague or misleading data analytics reports? Are you struggling to provide accurate data values to your end users? Have you ever experienced being misdirected by a geographical map application, leading you to the wrong destination? Imagine Amazon customers expressing dissatisfaction due to receiving the wrong product at their doorstep.

These issues stem from the use of incorrect or vague data by application/service providers. The need of the hour is to address these challenges by enhancing data quality processes and implementing robust data quality solutions. Through effective data management and validation, organizations can unlock valuable insights and make informed decisions.

"Harnessing the potential of clean data is like painting a masterpiece with accurate brushstrokes."

Introduction

Data quality assurance (QA) is the systematic approach organizations use to ensure they have reliable, correct, consistent, and relevant data. It involves various methods, approaches, and tools to maintain good data quality from commencement to termination.

What is Data Quality?

Data quality refers to the overall utility of a dataset and its ability to be easily processed and analyzed for other uses. It is an integral part of data governance that ensures your organization’s data is fit for purpose.

How can I measure Data Quality?

What is the critical importance of Data Quality?

Remember, good data is super important! So, invest in good data—it’s the secret sauce for business success!

What are the Data Quality Challenges?

1. Data quality issues on production:

Production-specific data quality issues are primarily caused by unexpected changes in the data and infrastructure failures.

A. Source and third-party data changes:

External data sources, like websites or companies, may introduce errors or inconsistencies, making it challenging to use the data accurately. These issues can lead to system errors or missing values, which might go unnoticed without proper monitoring.

Example:

File formats change without warning:

Imagine we're using an API to get data in CSV format, and we’ve made a pipeline that handles it well.

CODE: https://gist.github.com/velotiotech/ef7f7baefd728ffd2637479221b8b8a5.js

The data source switched to using the JSON format, breaking our pipeline. This inconsistency can cause errors or missing data if our system can't adapt. Monitoring and adjustments will ensure the accuracy of data analysis or applications.

Malformed data values and schema changes:

Suppose we're handling inventory data for an e-commerce site. The starting schema for your inventory dataset might have fields like:

Now, imagine that the inventory file's schema changed suddenly. A “quantity” column has been renamed to “qty,” and the last_updated_at timestamp format switches to epoch timestamp.

This change might not be communicated in advance, leaving our data pipeline unprepared to handle the new field and time format.

B. Infrastructure failures:

Reliable software is crucial for processing large data volumes, but even the best tools can encounter issues. Infrastructure failures, like glitches or overloads, can disrupt data processing regardless of the software used.

Solution:

Data observability tools such as Monte Carlo, BigEye, and Great Expectations help detect these issues by monitoring for changes in data quality and infrastructure performance. These tools are essential for identifying and alerting the root causes of data problems, ensuring data reliability in production environments.

2. Data quality issues during development:

Development-specific data quality issues are primarily caused by untested code changes.

A. Incorrect parsing of data:

Data transformation bugs can occur due to mistakes in code or parsing, leading to data type mismatches or schema inaccuracies.

Example:

Imagine we’re converting a date string ("YYYY-MM-DD") to a Unix epoch timestamp using Python. But misunderstanding the strptime() function's format specifier leads to unexpected outcomes.

CODE: https://gist.github.com/velotiotech/e6b1ce9e63e70d55e47ab5af6f760cc2.js

This error makes strptime() interpret "2024" as the year, "05" as the month (instead of the day), and "10" as the day (instead of the month), leading to inaccurate data in the timestamp_dt variable.

B. Misapplied or misunderstood requirements:

Even with the right code, data quality problems can still occur if requirements are misunderstood, resulting in logic errors and data quality issues.

Example:
Imagine we’re assigned to validate product prices in a dataset, ensuring they fall between $10 and $100.

CODE: https://gist.github.com/velotiotech/d72a14cdbdde666814ea2bfeda56034a.js

The requirement states prices should range from $10 to $100. But a misinterpretation leads the code to check if prices are >= $10 and <= $100. This makes $10 valid, causing a data quality problem.

C. Unaccounted downstream dependencies:

Despite careful planning and logic, data quality incidents can occur due to overlooked dependencies. Understanding data lineage and communicating effectively across all users is crucial to preventing such incidents.

Example:

Suppose we’re working on a database schema migration project for an e-commerce system. In the process, we rename the order_date column to purchase_date in the orders table. Despite careful planning and testing, a data quality issue arises due to an overlooked downstream dependency. The marketing team's reporting dashboard relies on a SQL query referencing the order_date column, now renamed purchase_date, resulting in inaccurate reporting and potentially misinformed business decisions.

Here's an example SQL query that represents the overlooked downstream dependency:

CODE: https://gist.github.com/velotiotech/5968ed108c8a1c1c84179b694d3ff3e8.js

This SQL query relies on the order_date column to calculate monthly sales metrics. After the schema migration, this column no longer exists, causing query failure and inaccurate reporting.

Solutions:

Data Quality tools like Great Expectations and Deequ proactively catch data quality issues by testing changes introduced from data-processing code, preventing issues from reaching production.

a. Testing assertions: Assertions validate data against expectations, ensuring data integrity. While useful, they require careful maintenance and should be selectively applied.

Example:
Suppose we have an "orders" table in your dbt project and need to ensure the "total_amount" column contains only numeric values; we can write a dbt test to validate this data quality rule.

CODE: https://gist.github.com/velotiotech/c70018372eed6c25ce8864f81026b443.js

In this dbt test code:

We specify the dbt version (version: 2), model named "orders," and "total_amount" column.
Within the "total_amount" column definition, we add a test named "data_type" with the value "numeric," ensuring the column contains only numeric data.
Running the dbt test command will execute this test, checking if the "total_amount" column adheres to the numeric data type. Any failure indicates a data quality issue.

b. Comparing staging and production data: Data Diff is a CLI tool that compares datasets within or across databases, highlighting changes in data similar to how git diff highlights changes in source code. Aiding in detecting data quality issues early in the development process.

Here's a data-diff example between staging and production databases for the payment_table.

CODE: https://gist.github.com/velotiotech/627aa42aee3c796a45dd52448b50048a.js

Source: https://docs.datafold.com/data_diff/what_is_data_diff

What are some best practices for maintaining high-quality data?

Establish Data Standards: Define clear data standards and guidelines for data collection, storage, and usage to ensure consistency and accuracy across the organization.

Data Validation: Implement validation checks to ensure data conforms to predefined rules and standards, identifying and correcting errors early in the data lifecycle.

Regular Data Cleansing: Schedule regular data cleansing activities to identify and correct inaccuracies, inconsistencies, and duplicates in the data, ensuring its reliability and integrity over time.

Data Governance: Establish data governance policies and procedures to manage data assets effectively, including roles and responsibilities, data ownership, access controls, and compliance with regulations.

Metadata Management: Maintain comprehensive metadata to document data lineage, definitions, and usage, providing transparency and context for data consumers and stakeholders.

Data Security: Implement robust data security measures to protect sensitive information from unauthorized access, ensuring data confidentiality, integrity, and availability.

Data Quality Monitoring: Continuously monitor data quality metrics and KPIs to track performance, detect anomalies, and identify areas for improvement, enabling proactive data quality management.

Data Training and Awareness: Provide data training and awareness programs for employees to enhance their understanding of data quality principles, practices, and tools, fostering a data-driven culture within the organization.

Collaboration and Communication: Encourage collaboration and communication among stakeholders, data stewards, and IT teams to address data quality issues effectively and promote accountability and ownership of data quality initiatives.

Continuous Improvement: Establish a culture of continuous improvement by regularly reviewing and refining data quality processes, tools, and strategies based on feedback, lessons learned, and evolving business needs.

Can you recommend any tools for improving data quality?

AWS Deequ: AWS Deequ is an open-source data quality library built on top of Apache Spark. It provides tools for defining data quality rules and validating large-scale datasets in Spark-based data processing pipelines.

Great Expectations: GX Cloud is a fully managed SaaS solution that simplifies deployment, scaling, and collaboration and lets you focus on data validation.

‍

Soda: Soda allows data engineers to test data quality early and often in pipelines to catch data quality issues before they have a downstream impact.

Datafold: Datafold is a cloud-based data quality platform that automates and simplifies the process of monitoring and validating data pipelines. It offers features such as automated data comparison, anomaly detection, and integration with popular data processing tools like dbt.

Considerations for Selecting a Data QA Tool:

Selecting a data QA (Quality Assurance) tool hinges on your specific needs and requirements. Consider factors such as:

1. Scalability and Performance: Ensure the tool can handle current and future data volumes efficiently, with real-time processing capabilities. some text

Example: Great Expectations help validate data in a big data environment by providing a scalable and customizable way to define and monitor data quality across different sources

2. Data Profiling and Cleansing Capabilities: Look for comprehensive data profiling and cleansing features to detect anomalies and improve data quality.some text

Example: AWS Glue DataBrew offers profiling, cleaning and normalizing, creating map data lineage, and automating data cleaning and normalization tasks.

3. Data Monitoring Features: Choose tools with continuous monitoring capabilities, allowing you to track metrics and establish data lineage.some text

Example: Datafold’s monitoring feature allows data engineers to write SQL commands to find anomalies and create automated alerts.

4. Seamless Integration with Existing Systems: Select a tool compatible with your existing systems to minimize disruption and facilitate seamless integration.some text

Example: dbt offers seamless integration with existing data infrastructure, including data warehouses and BI tools. It allows users to define data transformation pipelines using SQL, making it compatible with a wide range of data systems.

5. User-Friendly Interface: Prioritize tools with intuitive interfaces for quick adoption and minimal training requirements.some text

Example: Soda SQL is an open-source tool with a simple command line interface (CLI) and Python library to test your data through metric collection.

6. Flexibility and Customization Options: Seek tools that offer flexibility to adapt to changing data requirements and allow customization of rules and workflows.some text

Example: dbt offers flexibility and customization options for defining data transformation workflows.

7. Vendor Support and Community: Evaluate vendors based on their support reputation and active user communities for shared knowledge and resources.some text

Example: AWS Deequ is supported by Amazon Web Services (AWS) and has an active community of users. It provides comprehensive documentation, tutorials, and forums for users to seek assistance and share knowledge about data quality best practices.

8. Pricing and Licensing Options: Consider pricing models that align with your budget and expected data usage, such as subscription-based or volume-based pricing. some text

Example: Great Expectations offers flexible pricing and licensing options, including both open-source (freely available) and enterprise editions(subscription-based).

Ultimately, the right tool should effectively address your data quality challenges and seamlessly fit into your data infrastructure and workflows.

Conclusion: The Vital Role of Data Quality

In conclusion, data quality is paramount in today's digital age. It underpins informed decisions, strategic formulation, and business success. Without it, organizations risk flawed judgments, inefficiencies, and competitiveness loss. Recognizing its vital role empowers businesses to drive innovation, enhance customer experiences, and achieve sustainable growth. Investing in robust data management, embracing technology, and fostering data integrity are essential. Prioritizing data quality is key to seizing new opportunities and staying ahead in the data-driven landscape.

References:

https://docs.getdbt.com/docs/build/data-tests

https://www.soda.io/platform

https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/

https://www.soda.io/resources/introducing-soda-sql

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Data QA: The Need of the Hour

"Harnessing the potential of clean data is like painting a masterpiece with accurate brushstrokes."

Introduction

What is Data Quality?

How can I measure Data Quality?

What is the critical importance of Data Quality?

Remember, good data is super important! So, invest in good data—it’s the secret sauce for business success!

What are the Data Quality Challenges?

1. Data quality issues on production:

Production-specific data quality issues are primarily caused by unexpected changes in the data and infrastructure failures.

A. Source and third-party data changes:

Example:

File formats change without warning:

Imagine we're using an API to get data in CSV format, and we’ve made a pipeline that handles it well.

CODE: https://gist.github.com/velotiotech/ef7f7baefd728ffd2637479221b8b8a5.js

Malformed data values and schema changes:

Suppose we're handling inventory data for an e-commerce site. The starting schema for your inventory dataset might have fields like:

Now, imagine that the inventory file's schema changed suddenly. A “quantity” column has been renamed to “qty,” and the last_updated_at timestamp format switches to epoch timestamp.

This change might not be communicated in advance, leaving our data pipeline unprepared to handle the new field and time format.

B. Infrastructure failures:

Solution:

2. Data quality issues during development:

Development-specific data quality issues are primarily caused by untested code changes.

A. Incorrect parsing of data:

Data transformation bugs can occur due to mistakes in code or parsing, leading to data type mismatches or schema inaccuracies.

Example:

Imagine we’re converting a date string ("YYYY-MM-DD") to a Unix epoch timestamp using Python. But misunderstanding the strptime() function's format specifier leads to unexpected outcomes.

CODE: https://gist.github.com/velotiotech/e6b1ce9e63e70d55e47ab5af6f760cc2.js

This error makes strptime() interpret "2024" as the year, "05" as the month (instead of the day), and "10" as the day (instead of the month), leading to inaccurate data in the timestamp_dt variable.

B. Misapplied or misunderstood requirements:

Even with the right code, data quality problems can still occur if requirements are misunderstood, resulting in logic errors and data quality issues.

Example:
Imagine we’re assigned to validate product prices in a dataset, ensuring they fall between $10 and $100.

CODE: https://gist.github.com/velotiotech/d72a14cdbdde666814ea2bfeda56034a.js

The requirement states prices should range from $10 to $100. But a misinterpretation leads the code to check if prices are >= $10 and <= $100. This makes $10 valid, causing a data quality problem.

C. Unaccounted downstream dependencies:

Example:

Here's an example SQL query that represents the overlooked downstream dependency:

CODE: https://gist.github.com/velotiotech/5968ed108c8a1c1c84179b694d3ff3e8.js

This SQL query relies on the order_date column to calculate monthly sales metrics. After the schema migration, this column no longer exists, causing query failure and inaccurate reporting.

Solutions:

Data Quality tools like Great Expectations and Deequ proactively catch data quality issues by testing changes introduced from data-processing code, preventing issues from reaching production.

a. Testing assertions: Assertions validate data against expectations, ensuring data integrity. While useful, they require careful maintenance and should be selectively applied.

CODE: https://gist.github.com/velotiotech/c70018372eed6c25ce8864f81026b443.js

In this dbt test code:

We specify the dbt version (version: 2), model named "orders," and "total_amount" column.
Within the "total_amount" column definition, we add a test named "data_type" with the value "numeric," ensuring the column contains only numeric data.
Running the dbt test command will execute this test, checking if the "total_amount" column adheres to the numeric data type. Any failure indicates a data quality issue.

Here's a data-diff example between staging and production databases for the payment_table.

CODE: https://gist.github.com/velotiotech/627aa42aee3c796a45dd52448b50048a.js

What are some best practices for maintaining high-quality data?

Establish Data Standards: Define clear data standards and guidelines for data collection, storage, and usage to ensure consistency and accuracy across the organization.

Data Validation: Implement validation checks to ensure data conforms to predefined rules and standards, identifying and correcting errors early in the data lifecycle.

Regular Data Cleansing: Schedule regular data cleansing activities to identify and correct inaccuracies, inconsistencies, and duplicates in the data, ensuring its reliability and integrity over time.

Data Governance: Establish data governance policies and procedures to manage data assets effectively, including roles and responsibilities, data ownership, access controls, and compliance with regulations.

Metadata Management: Maintain comprehensive metadata to document data lineage, definitions, and usage, providing transparency and context for data consumers and stakeholders.

Data Security: Implement robust data security measures to protect sensitive information from unauthorized access, ensuring data confidentiality, integrity, and availability.

Data Quality Monitoring: Continuously monitor data quality metrics and KPIs to track performance, detect anomalies, and identify areas for improvement, enabling proactive data quality management.

Data Training and Awareness: Provide data training and awareness programs for employees to enhance their understanding of data quality principles, practices, and tools, fostering a data-driven culture within the organization.

Collaboration and Communication: Encourage collaboration and communication among stakeholders, data stewards, and IT teams to address data quality issues effectively and promote accountability and ownership of data quality initiatives.

Continuous Improvement: Establish a culture of continuous improvement by regularly reviewing and refining data quality processes, tools, and strategies based on feedback, lessons learned, and evolving business needs.

Can you recommend any tools for improving data quality?

AWS Deequ: AWS Deequ is an open-source data quality library built on top of Apache Spark. It provides tools for defining data quality rules and validating large-scale datasets in Spark-based data processing pipelines.

Great Expectations: GX Cloud is a fully managed SaaS solution that simplifies deployment, scaling, and collaboration and lets you focus on data validation.

‍

Soda: Soda allows data engineers to test data quality early and often in pipelines to catch data quality issues before they have a downstream impact.

Datafold: Datafold is a cloud-based data quality platform that automates and simplifies the process of monitoring and validating data pipelines. It offers features such as automated data comparison, anomaly detection, and integration with popular data processing tools like dbt.