When data lies: Addressing and handling bias in data analysis
The moment we assume our data is neutral… is the moment we stop being data scientists.
One of the humbling aspects of data analysis is its ability to prove you wrong. Data will always show trends and patterns, but sometimes, when you’re buried in numbers, you can fall into a state of constantly “analyzing” while unintentionally ignoring the potential of the data and its deeper implications.
I often refer to my article: “The human side to data”, because it always reminds me that data represents a real person, animal, or thing.
The Human Side to Data: Shaping analytics for real-world impact
We are more than just numbers!
In January, I met with another data professional, and we discussed an approach to survey data analysis, a field I’ve been exploring recently. As we discussed, we started debating bias in data science.
“Bias isn’t a big deal if the data is large enough,” I said.
“But even big data can be biased,” he countered. “If your sample isn’t representative, size won’t fix it.”
For the sake of time, I quickly summarized that bias may be insignificant if you have a large dataset, and he disagreed that large data can be biased. In this case, I assumed during sample collection. I was about to argue further when we pulled up a dataset on loan approvals. At first glance, the numbers seemed fine until we stratified by gender.
We saw a trend: women were being denied loans at a higher rate despite similar credit scores.
Another study from the University of Washington’s Foster School of Business confirms that women-owned businesses face higher interest rates and lower approval rates than men, despite similar risk profiles.
We were quiet for a moment as this revelation addressed our argument. The dataset wasn’t biased on the surface, but the decision-making process that generated it was. We had both been wrong.
How bias gets into data
Bias is a statistical inconvenience and a flaw that leads to flawed insights and wrong decisions. In data analysis, bias occurs at different stages: data collection, processing, and interpretation.
1. Data collection: occurs when data is collected in a way that skews results.
Selection (or sampling) bias: This is a scenario where the sample doesn’t represent the real or target population.
For example, you have a hiring algorithm trained only on past successful (male) candidates, which may overlook qualified women, or you did a survey on product pricing, and only high-income earners respond to the survey. This doesn’t reflect the broader customer base.
Non-response bias: When a specific group does not respond, this can lead to an unbalanced dataset and skewed results.
E.g., Younger people may ignore email surveys, while older people respond, leading to age-related skew.
2. Measurement and response bias: How do you collect and structure responses even if the sample is correct?
Measurement bias: This is introduced by errors in how data is collected or measured.
E.g., You survey job satisfaction, but only senior employees respond. Of course, there will be a positive skew if “only” senior employees respond.
Response bias: occurs when participants answer inaccurately due to question-wording, social desirability, or survey fatigue.
E.g. employees may underreport dissatisfaction in a company survey out of fear of repercussions from their responses.
Question order and wording bias: when the questions are phrased or ordered in a way that influences responses.
For example, a leading question like “How much do you like our services?” may push respondents to give a positive answer and influence responses.
3. Analytical bias (data processing and interpretation): bias can creep in during analysis
Confirmation bias: When analysts unconsciously seek patterns that confirm existing beliefs.
E.g., if you expect younger employees to have higher engagement, you may subconsciously favor data that supports that.
Omitted variable bias: When an important factor is left out of the analysis, leading to misleading conclusions.
E.g. a study on employee performance that excludes work-life balance might misinterpret stress-related productivity declines.
Survivorship bias: When conclusions are drawn only from successful or visible cases, ignoring failures.
For example, analyzing only profitable startups may lead to misleading conclusions about success rates in entrepreneurship.
Handling bias: A human approach
1. Identify the type of bias: Understand where the bias originated.
2. Minimize bias at the survey design stage
Use random sampling: One of the best ways to deal with bias is to ensure a diverse representation by randomizing respondent selection.
Balance demographics: Set quotas (age, gender, location) to match the target population; this will help your sample size and make the sample more accurate.
Neutral wording/order: Avoid leading questions and mix question order to reduce priming effects. Be as neutral as possible. Remember, for actionable insights, the data has to be free of bias.
Offer anonymous responses: Confidentiality can encourage honest feedback and improve data integrity.
3. Detecting & correcting bias in data preprocessing
Once responses are collected, you can statistically adjust for bias:
Weighting responses: Assign higher weights to underrepresented groups to balance the dataset.
Imputation for missing data: Use mean/mode imputation or predictive modeling to estimate missing data.
Outlier detection: Identify and remove extreme responses that skew results (e.g., someone rating everything as 10/10).
Compare with other known data: Cross-check against census or industry benchmarks.
Let’s take the loan approval dataset and walk through an approach to detecting and mitigating bias.
1. EDA for bias detection
Before running models, analyze subgroup distributions. In Python:
import pandas as pd
df = pd.read_csv("loan_data.csv")
# Check approval rates by gender
df.groupby("Gender")["Approved"].mean()
If there’s a significant gap, bias may be present.
2. Rebalancing the data using SMOTE
If one group is underrepresented, apply resampling or reweighting.
from imblearn.over_sampling import SMOTE
X = df.drop(columns=["Approved"])
y = df["Approved"]
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X, y)
3. Using fairness-aware models: Certain algorithms, such as adversarial debiasing techniques or fairness constraints in models (e.g., IBM’s AI Fairness 360), can help mitigate bias.
Bias awareness. Bias is always there.
The lesson I learned that day? Bias is not just an occasional data flaw. Even after adjustments, bias can still exist. So, what do we do?
1. Acknowledge it: Mention potential limitations in your findings. Maybe you’ve heard or seen analysts or statisticians talk about the limitations of their data, analysis tools, or analysis. This is bias awareness.
2. Measure it: Use fairness metrics to quantify bias in models.
3. Compare with external data: Validate your insights against historical trends or industry benchmarks.
4. Actively mitigate it: Run sensitivity analyses to test if different methodologies significantly change results.
5. Finally, address it in your reporting. Be transparent. Communicate bias risks in your conclusions.
The moment we assume our data is neutral… is the moment we stop being data scientists.
Be data-informed, data-driven, but not data-obsessed — Amy
Biz and whimsy: https://linktr.ee/amyusifoh
Consider supporting DATM: https://buymeacoffee.com/omomou
🔗 Connect with me on LinkedIn and GitHub for more data analytics insights.
#dataScience #dataanalysis #Python #statistics #MicrosoftExcel
Data analyst⬩Spreadsheet advocate ⬩freelancer⬩turning data into useful insights