Handling Missing Data in Machine Learning

9/19/2025

diagram of Missing Data in Machine Learning

Go Back
diagram of Missing Data in Machine Learning

Handling Missing Data in Machine Learning: Techniques, Examples, and Best Practices

Introduction

Missing data is one of the most common challenges faced during machine learning projects. Incomplete data can reduce model accuracy, introduce bias, and make predictions unreliable. Handling missing values effectively is a crucial part of data preprocessing and ensures the quality of your machine learning model.

In this article, we will explore:

  • Types of missing data

  • Techniques to handle missing data

  • Practical Python examples

  • Best practices for dealing with missing values


Types of Missing Data

Before applying any technique, it is important to understand the type of missing data:

TypeDescriptionExample
MCAR (Missing Completely at Random)Missing values occur randomly and have no relation to any variable.A sensor fails occasionally
MAR (Missing at Random)Missing values depend on other observed variables.People with higher salaries skip age input
MNAR (Missing Not at Random)Missing values depend on unobserved variables.People with low income choose not to report income

Identifying the type of missingness helps in selecting the most appropriate handling method.


Common Techniques to Handle Missing Data

There are several ways to handle missing values in machine learning datasets.


1. Deletion Methods

  • Listwise Deletion: Remove rows that contain any missing value.

  • Column Deletion: Remove entire columns with too many missing values.

Pros: Easy and fast.
Cons: Risk of losing valuable data, only suitable when missing values are very few.


2. Simple Imputation Methods

  • Mean/Median/Mode Imputation: Replace missing numeric values with mean or median, and categorical values with mode.

  • Constant Imputation: Replace missing entries with a fixed value like 0 or "Unknown".

  • Forward/Backward Fill (for time series): Fill missing values with previous or next observations.

Pros: Quick and simple.
Cons: Can distort data distribution if used excessively.


3. Advanced (Model-Based) Imputation

  • K-Nearest Neighbors (KNN) Imputation: Uses similar records to estimate missing values.

  • Regression Imputation: Predicts missing values using other features.

  • Multiple Imputation: Creates multiple datasets with different imputed values and combines results for better accuracy.

Pros: More accurate and reliable.
Cons: Computationally expensive and complex.


4. Use Algorithms That Handle Missing Data

Some machine learning algorithms (like XGBoost, LightGBM, and CatBoost) can handle missing values internally without the need for preprocessing.


Example: Handling Missing Data with Python

import pandas as pd
from sklearn.impute import SimpleImputer

# Sample dataset
df = pd.DataFrame({
    'Age': [25, None, 30, 28],
    'Salary': [50000, 60000, None, 55000]
})

# Mean Imputation
imputer = SimpleImputer(strategy='mean')
df[['Age','Salary']] = imputer.fit_transform(df[['Age','Salary']])

print(df)

Output:

    Age    Salary
0  25.0  50000.0
1  27.67 60000.0
2  30.0  55000.0
3  28.0  55000.0

Best Practices

  • Analyze the pattern of missingness before deciding on a method.

  • Apply imputation only on the training set, and use the same transformation for the test set.

  • Add a “missing indicator” feature to capture missingness information.

  • Always evaluate model performance before and after handling missing data.


Summary Table

MethodWhen to UseNotes
DeletionFew missing valuesRisk of data loss
Mean/Median/ModeSmall gapsQuick but can reduce variance
KNN/RegressionLarger datasetsMore accurate, slower
Multiple ImputationMAR missingnessStatistically robust
Algorithm SupportBuilt-in handlingSupported by few models

Conclusion

Handling missing data is a critical step in machine learning preprocessing. Whether you choose deletion, imputation, or advanced model-based techniques depends on your dataset size, type of missingness, and project requirements. By cleaning and imputing your data properly, you can significantly improve model accuracy and reliability.

Table of content