Dimensionality Reduction in Machine Learning | PCA & LDA Explained

9/27/2025

diagram of PCA & LDA Explained

Go Back

Dimensionality Reduction Techniques in Machine Learning: PCA & LDA Explained

In machine learning, handling datasets with a large number of features can lead to high computational cost, overfitting, and difficulty in visualization. Dimensionality reduction helps simplify these datasets by reducing the number of features while retaining important information.

Two of the most popular techniques are Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA). In this article, we’ll explore these methods, their differences, applications, and examples in Python.


 diagram of PCA & LDA Explained

🔹 What is Dimensionality Reduction?

Dimensionality reduction is the process of reducing the number of input variables in a dataset while preserving as much important information as possible. Benefits include:

  • Lower computational cost

  • Reduced risk of overfitting

  • Easier visualization of high-dimensional data

  • Improved model performance

Dimensionality reduction can be unsupervised (e.g., PCA) or supervised (e.g., LDA).


🔹 Principal Component Analysis (PCA)

Overview

  • PCA is an unsupervised technique.

  • It transforms features into a new set of orthogonal components called principal components.

  • The first principal component captures the maximum variance, followed by the next components.

Key Steps

  1. Standardize the dataset.

  2. Compute the covariance matrix.

  3. Calculate eigenvalues and eigenvectors.

  4. Select top principal components based on explained variance.

  5. Transform the dataset.

Python Example

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np

X = np.random.rand(100, 10)  # Example dataset
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

print("Explained variance ratio:", pca.explained_variance_ratio_)

Applications

  • Image compression

  • Noise reduction

  • Feature extraction

  • Data visualization


🔹 Linear Discriminant Analysis (LDA)

Overview

  • LDA is a supervised dimensionality reduction technique.

  • It maximizes class separability while reducing dimensionality.

  • Works well when you have labeled data.

Key Steps

  1. Compute the mean vectors for each class.

  2. Compute the within-class and between-class scatter matrices.

  3. Compute eigenvalues and eigenvectors of the scatter matrices.

  4. Select linear discriminants with highest eigenvalues.

  5. Transform dataset onto new feature subspace.

Python Example

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
import numpy as np

X = np.random.rand(100, 5)
y = np.random.randint(0, 2, 100)  # Class labels

lda = LDA(n_components=1)
X_lda = lda.fit_transform(X, y)
print("Transformed dataset shape:", X_lda.shape)

Applications

  • Face recognition

  • Pattern classification

  • Text categorization

  • Medical diagnostics


🔹 PCA vs LDA

FeaturePCALDA
TypeUnsupervisedSupervised
GoalMaximize varianceMaximize class separability
OutputPrincipal componentsLinear discriminants
Use caseFeature extraction, visualizationClassification problems

🔹 Advantages of Dimensionality Reduction

  • Reduces overfitting

  • Speeds up training

  • Improves visualization of complex datasets

  • Helps with noise reduction


🔹 Limitations

  • PCA may lose interpretability of original features

  • LDA requires labeled data and works best with normally distributed classes

  • Choosing the optimal number of components can be tricky


🔹 Real-World Applications

  • Finance: Risk factor analysis using PCA

  • Healthcare: Disease classification using LDA

  • Computer Vision: Image recognition and compression

  • NLP: Topic modeling and text classification


Conclusion

Dimensionality reduction is a critical step in modern machine learning pipelines. PCA and LDA help simplify high-dimensional datasets, reduce computational cost, and improve model performance.

Choosing the right technique depends on your data type and goal—use PCA for unsupervised analysis and LDA for supervised classification tasks.