Dimensionality Reduction in Machine Learning | PCA & LDA Explained
diagram of PCA & LDA Explained
In machine learning, handling datasets with a large number of features can lead to high computational cost, overfitting, and difficulty in visualization. Dimensionality reduction helps simplify these datasets by reducing the number of features while retaining important information.
Two of the most popular techniques are Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA). In this article, we’ll explore these methods, their differences, applications, and examples in Python.
Dimensionality reduction is the process of reducing the number of input variables in a dataset while preserving as much important information as possible. Benefits include:
Lower computational cost
Reduced risk of overfitting
Easier visualization of high-dimensional data
Improved model performance
Dimensionality reduction can be unsupervised (e.g., PCA) or supervised (e.g., LDA).
PCA is an unsupervised technique.
It transforms features into a new set of orthogonal components called principal components.
The first principal component captures the maximum variance, followed by the next components.
Standardize the dataset.
Compute the covariance matrix.
Calculate eigenvalues and eigenvectors.
Select top principal components based on explained variance.
Transform the dataset.
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
X = np.random.rand(100, 10) # Example dataset
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
print("Explained variance ratio:", pca.explained_variance_ratio_)
Image compression
Noise reduction
Feature extraction
Data visualization
LDA is a supervised dimensionality reduction technique.
It maximizes class separability while reducing dimensionality.
Works well when you have labeled data.
Compute the mean vectors for each class.
Compute the within-class and between-class scatter matrices.
Compute eigenvalues and eigenvectors of the scatter matrices.
Select linear discriminants with highest eigenvalues.
Transform dataset onto new feature subspace.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
import numpy as np
X = np.random.rand(100, 5)
y = np.random.randint(0, 2, 100) # Class labels
lda = LDA(n_components=1)
X_lda = lda.fit_transform(X, y)
print("Transformed dataset shape:", X_lda.shape)
Face recognition
Pattern classification
Text categorization
Medical diagnostics
Feature | PCA | LDA |
---|---|---|
Type | Unsupervised | Supervised |
Goal | Maximize variance | Maximize class separability |
Output | Principal components | Linear discriminants |
Use case | Feature extraction, visualization | Classification problems |
Reduces overfitting
Speeds up training
Improves visualization of complex datasets
Helps with noise reduction
PCA may lose interpretability of original features
LDA requires labeled data and works best with normally distributed classes
Choosing the optimal number of components can be tricky
Finance: Risk factor analysis using PCA
Healthcare: Disease classification using LDA
Computer Vision: Image recognition and compression
NLP: Topic modeling and text classification
Dimensionality reduction is a critical step in modern machine learning pipelines. PCA and LDA help simplify high-dimensional datasets, reduce computational cost, and improve model performance.
Choosing the right technique depends on your data type and goal—use PCA for unsupervised analysis and LDA for supervised classification tasks.