HandsonML 8. Dimension Reduction

1. Problems of millions of features

  1. slow training

  2. easy overfitting

1.1. Solution = dimension reduction

  1. bad: information loss

  2. bad: complex pipeline

  3. good: speed up training

  4. good: data visualization (2D image or clustering)

No guarantee to filter noise, unnecessary details & better result.

1.2. DR techniques

  1. Projection

  2. Manifold learning

  3. PCA

  4. Maintain distance (LLE, tSNE)

1.3. High dimension v.s. low dimension

Low dimension

High dimension

on border

0.4% on border

99.9% on border

distance between 2 points

0.52 for 2-D

408 for 1M-D

Large dimension dataset:

(1) sparse: instances are far away from each other

(2) risk of overfitting.

2. Projection

(1) Training instances lie on lower-dimension subspace of the high-dimension space.

(2) Problem: could squash data when distribution is twisted.

3. Manifold Learning

you need to understand data distribution first

4. PCA (principle component analysis)

(1) Identify hyper-plane that lies closest to the data.

(2) Preserve variance as much as possible

4.1. Preserving maximum variance -> lose less information

find the axis which preserve the maximum variance (C1 & C2 in this case)

PC = C1(maximum variance) & C2(largest remaining variance)

4.2. SVD, single value decomposition

how to find PC? => (SVD) single value decomposition

X is the training set.

X=UΣVTX = U \cdot \Sigma \cdot V^{T}
V=[PC1,PC2,PC3,]V = \left [ PC_{1}, PC_{2}, PC_{3}, \cdots \right ]

Projecting the training set down to d-dimension.

Wd contains first d column of V.

Xd_proj=XWdX_{d\_proj} = X \cdot W_{d}

4.3. Choosing the right number of dimensions

  1. for data visualization = select 2 or 3

  2. summation of variance >= 95%

4.4. Reconstructing error

mean squared distance between the original data & the reconstructed data

Xreconstructed=Xd_projWdTX_{reconstructed} = X_{d\_proj} \cdot W_{d}^{T}

4.5. Other PCA

randomized PCA: reduce computation complexity

incremental PCA: avoiding feeding the whole training set, but feeding by mini-batches.

Kernel PCA: map instances into a very high-dim (feature space) linear-> nonlinear

LLE (Local linear embedding): (1) measuring each training instance linearly relates to its closest neighbors (2) looking for a low-dimensional where local relationships are best preserved (3) good at unrolling twisted manifolds (4) poor scaling

t-SNE (t-Distributed Stochastic Neighbor Embedding): keep similar instances close and dissimilar instances apart.

4.6. How to choose hyper-params for Unsupervised learning (e.g. kPCA)

  1. Unsupervised learning = no obvious performance measure

  2. Unsupervised learning = the input for another supervised learning

  1. Grid search to select params that lead to the best performance for the supervised task.

  2. Lowest reconstruction error.

Last updated

Was this helpful?