Data Verification vs. Data Validation vs. Data Validity
Data verification - the technical accuracy of the data; data validation - traditional accuracy; and data validity - its contextual application.
Data integrity and relevance are non-negotiable in data analysis and data science because they are prerequisites for generating good insights and making informed decisions.
Poor-quality data means poor-quality insights, and organizations risk making decisions based on flawed information.
To ensure data integrity, analysts should master three critical concepts: data validation, data verification, and data validity.
Although these terms are often used interchangeably, they address different aspects of data quality. They play distinct but complementary roles in achieving a common goal.
I’ll explain their distinctions, provide actionable steps to implement them, and offer tips on communicating these processes effectively.
Data Validation
In the traditional sense, data validation ensures that data is correct, complete, and meaningful before it is accepted into a system or used for analysis. It checks whether the data meets predefined rules or business requirements at the point of entry or ingestion.
Key characteristics of data validation
When it happens: At the point of data entry or during ingestion.
Purpose: To prevent data errors and invalid data from entering the system.
Focus: To ensure that incoming data is accurate and meets operational rules (e.g., data format, completeness, range).
Examples of data validation
At Data entry:
Input format checks: e.g. phone numbers follow the pattern (XXX) XXX-XXXX and
Order Date
should be in the correctYYYY-MM-DD
format.Range validation: Validate that columns that hold sales value e.g.
Order Value
andQuantity Sold
are greater than 0 and are positive integers.Mandatory fields: Some fields like
Customer Name
andOrder Date
are not null.Dropdowns and constraints: Enforce allowed values (e.g.,
Payment Status
must be “Paid”, “Pending”, or “Failed”).
2. In preparation for analysis:
Check for duplicate rows.
Filter datasets to include only relevant time periods or regions.
→ Tools for data validation
Forms or APIs: You can use real-time validation for user inputs, e.g. forms with required fields.
ETL pipelines: Pre-load pipeline checks to ensure the data conforms to schema requirements.
Database constraints: Using primary keys, foreign keys, and unique constraints on columns helps with data validation.
Data Verification
Data verification ensures that data handling processes and transformations are accurate and error-free.
It confirms that operations performed on the data (aggregations, calculations, and transformations) outputs correct and reliable results.
Key characteristics of data verification
When it happens: After data processing, during and after data analysis.
Purpose: To detect and correct errors introduced during data handling.
Focus: Technical checks on processes (e.g., aggregations, joins).
Examples of data verification
Aggregation checks: Verify that calculated metrics and aggregations, like monthly sales totals, match source data.
ETL verification: Check for unintended changes (e.g., missing rows after merging datasets or data loss during ETL).
Audit column relationships (e.g.,
Customer ID
consistently linked toOrder ID
).Check data completeness: All necessary fields, e.g.,
Order Date
,Order Number
andCustomer ID
are available.
→ Tools for data verification
ETL logs: Track record counts and transformation steps
Scripts and test cases: Validate calculations and transformations programmatically.
Audit trails/documentation: Always maintain logs for traceability and debugging and document ETL logs and scripts.
Data Validity
Data validity is the state of data being accurate, relevant, and representative of the real-world phenomenon it is meant to reflect.
Key characteristics of data validity
When it applies: Data validity is a continuous process. It should be applied during and after analysis so that whatever you’re doing is relevant to stakeholders.
Purpose: To ensure data is meaningful and aligned with analysis goals.
Focus: Contextual relevance and accuracy.
Examples of data validity
Business relevance/context: Does the dataset cover the intended scope (e.g., 2023 North American sales) or represent the target population (avoid sampling bias)?
Benchmark against trends: Compare and validate trends identified in the analysis against industry benchmarks or historical data to confirm their plausibility.
Stakeholder alignment: Your results should address the key business questions, like the effects of targeted campaigns and promotions during holiday seasons.
Confirming trends identified in analysis align with domain knowledge.
Incorporate stakeholder feedback: Share initial findings with business stakeholders for feedback to ensure they align with the primary objectives. You don’t want to make the same mistake I did.
6 Data Science Mistakes That Made Me Look Stupid (So You Don’t Have To)
This is a 3-min read by Trestle touching similar subjects:
Data Validation vs. Data Verification: Understanding the Differences - Trestle
Two key processes that contribute to maintaining high data standards are data validation and data verification…trestleiq.com
How Do They Relate
How validation, verification, and validity fit together in the data workflow:
A practical framework to implement DQ checks
Using a simple dataset, e.g., sales data for an online store (mine) collected over a year.
Validation in practice
P.S. validation can be at the point of entry and during analysis to address business questions, goals, and context.
Ensure
Order Value
is greater than zero.Data type validation: Numerical fields like
Order Value
are actually numeric andOrder Date
are in a properYYYY-MM-DD
formatCheck that all required fields (
Customer ID
,Product ID
, etc.) are non-null.
Verification after processing
Compare aggregated revenue totals with source data.
Verify that no records were lost during ETL processes (debug data pipelines)
Audit relationships between tables (e.g.,
Customer ID
consistently matchesOrder ID
).Check data completeness: All necessary fields, e.g.,
Order Date
,Order Number
andCustomer ID
are available.
Validity check for analysis
Using the e-commerce example, data validation would be:
Business relevance: validate that the data analyzed represents the correct region and time specified by the project goals.
Benchmark against trends: Compare sales trends identified in the analysis against industry or external benchmarks or historical data to confirm their plausibility. (e.g., industry-wide seasonal trends).
Stakeholder alignment: The results should address the key business questions.
Check for anomalies in trends (e.g., outliers in
Order Value
).
Bonus tip for workflow Optimization
Automate repetitive validation and verification checks with Python scripts or ETL pipelines and visualize results with tools like Tableau or Matplotlib.
Demystifying Automation
Spending countless hours cleaning, analyzing, and visualizing data manually is tiresome and frustrating. What if you…python.plainenglish.io
Visualizing Validation, Verification, and Validity
What better way to communicate the results of these processes than through a dashboard?
Here’s a simple idea: you can use a:
Verification tab:
A bar chart can show missing records by columns, which can be helpful in re-structuring data collection and ETL processes.
A pie chart: distribution of records with incorrect data types.
These can be useful to track data verification processes over time.
2. Validation tab:
Use a heatmap to compare sales across regions and periods to identify unusual trends. Heatmaps are considered best practice, especially when visualizing regions.
A scatter plot can help detect outliers between
Order Value
vs.Order Date
3. Validity tab:
Summary cards are great for highlighting key findings. You can use this to highlight the top sales-driving region.
Scatter Plot: Compare data trends with external benchmarks.
Box Plot: Identify anomalies or outliers in key metrics.
Why data verification, validation, and validity?
1. When data is accurate and relevant, decision-makers can trust the insights.
2. Unverified and invalid data can lead to wrong conclusions, waste resources (time and people), and affect reputations.
3. Regulatory compliance: Many industries, like healthcare or finance, require rigorous data validation to meet legal standards.
Applying data validation, verification, and validity ensures:
Data enters systems correctly (validation).
Data processes are error-free (verification).
Data is relevant and reliable for analysis (validity).
and with these, you can create a compelling narrative for stakeholders.
Good data means good insights.
Some challenges with data verification, validation, and validity?
As crucial as these processes are, they can also be challenging.
Some challenges:
Domain Knowledge: Accurate validation requires a deep understanding of the subject matter, which can be a barrier when dealing with complex or niche datasets.
Time Constraints: These processes can be time-consuming, particularly with large datasets or data from multiple sources.
Changing Requirements: When you build a pipeline for specific requirements, any change can disrupt the architecture, leading to scope creep or additional workload.
To address these challenges, collaborate with domain experts, automate validation processes to save time, and design flexible, modular pipelines that adapt to changing requirements with minimal rework.
Integrating these processes into your analytic workflow can enhance the quality and credibility of your analyses.
You can use tools like Python and Tableau to practicalize these concepts and visualize the processes. This can promote collaboration and transparency with stakeholders.
Using Python (Pandas):
sample code to run verification checks:
import pandas as pd
#Load dataset
data = pd.read_csv("sales_data.csv")
#missing values
missing_values = data.isnull().sum()
#duplicates
duplicates = data.duplicated().sum()
#data types
data_types = data.dtypes
print(f"Missing values:\n{missing_values}")
print(f"Duplicates: {duplicates}")
print(f"Data Types:\n{data_types}")
This code checks the mean of missing values and duplicates and checks the data type.
sample code for data validation:
#Filter data for region and time frame
validated_data = data[(data['Region'] == 'North America') &
(data['Order Date'] >= '2023-01-01')]
print(validated_data.describe())
#trends
sales_by_month = validated_data.groupby(validated_data['Order Date'].dt.month)['Order Value'].sum()
print(sales_by_month)
This prints the data for sales in North America from Jan 1st, 2023, gives necessary information about it, and groups the sales by different months.
You can use Matplotlib or Seaborn to visualize these results.
Data verification, validation, and validity can be the difference between actionable insights and costly missteps.
But like
said:“Number 6 reminded me of the classic Systems Engineering saying with respect to Verification & Validation:
Verification — Are we building the thing right?
Validation — Are we building the right thing?”
Whether you’re using Python, Tableau, or Excel, the principles remain the same — quality data leads to quality decisions, and implementing these steps and leveraging dashboards for communication means you can tackle complex data projects with reliable results.
Be data-informed, data-driven, but not data-obsessed — Amy
Biz and whimsy: https://linktr.ee/ameusifoh
🔗 Connect with me on LinkedIn and GitHub for more data analytics insights.
#Dataanalysis #DataScience #Python #Tableau #PowerBI #Looker #Excel
Data Enthusiast | Spreadsheet Advocate | turning data into useful insights
What is your opinion on creating an AI agent framework capable of performing data analysis?
Data validity is such big one! I'v heard that question several times in my career, "is it accurate?", to which my response is "it is accurate in that it reflects the data entered into the CRM by your team". Yes, it is technically accurate, but it is only as valid as the data captured. Validity and accuracy are two different things, although both are important.