HandsonML 8. Dimension Reduction

1. Problems of millions of features

  1. slow training

  2. easy overfitting

1.1. Solution = dimension reduction

  1. bad: information loss

  2. bad: complex pipeline

  3. good: speed up training

  4. good: data visualization (2D image or clustering)

circle-info

No guarantee to filter noise, unnecessary details & better result.

1.2. DR techniques

circle-info
  1. Projection

  2. Manifold learning

  3. PCA

  4. Maintain distance (LLE, tSNE)

1.3. High dimension v.s. low dimension

Low dimension

High dimension

on border

0.4% on border

99.9% on border

distance between 2 points

0.52 for 2-D

408 for 1M-D

circle-info

Large dimension dataset:

(1) sparse: instances are far away from each other

(2) risk of overfitting.

2. Projection

circle-info

(1) Training instances lie on lower-dimension subspace of the high-dimension space.

(2) Problem: could squash data when distribution is twisted.

3. Manifold Learning

you need to understand data distribution first

4. PCA (principle component analysis)

circle-info

(1) Identify hyper-plane that lies closest to the data.

(2) Preserve variance as much as possible

4.1. Preserving maximum variance -> lose less information

find the axis which preserve the maximum variance (C1 & C2 in this case)

PC = C1(maximum variance) & C2(largest remaining variance)

4.2. SVD, single value decomposition

how to find PC? => (SVD) single value decomposition

X is the training set.

X=UΣVTX = U \cdot \Sigma \cdot V^{T}
V=[PC1,PC2,PC3,]V = \left [ PC_{1}, PC_{2}, PC_{3}, \cdots \right ]

Projecting the training set down to d-dimension.

Wd contains first d column of V.

Xd_proj=XWdX_{d\_proj} = X \cdot W_{d}

4.3. Choosing the right number of dimensions

  1. for data visualization = select 2 or 3

  2. summation of variance >= 95%

4.4. Reconstructing error

mean squared distance between the original data & the reconstructed data

Xreconstructed=Xd_projWdTX_{reconstructed} = X_{d\_proj} \cdot W_{d}^{T}

4.5. Other PCA

randomized PCA: reduce computation complexity

incremental PCA: avoiding feeding the whole training set, but feeding by mini-batches.

Kernel PCA: map instances into a very high-dim (feature space) linear-> nonlinear

LLE (Local linear embedding): (1) measuring each training instance linearly relates to its closest neighbors (2) looking for a low-dimensional where local relationships are best preserved (3) good at unrolling twisted manifolds (4) poor scaling

t-SNE (t-Distributed Stochastic Neighbor Embedding): keep similar instances close and dissimilar instances apart.

4.6. How to choose hyper-params for Unsupervised learning (e.g. kPCA)

  1. Unsupervised learning = no obvious performance measure

  2. Unsupervised learning = the input for another supervised learning

circle-info
  1. Grid search to select params that lead to the best performance for the supervised task.

  2. Lowest reconstruction error.

Last updated