HandsonML 8. Dimension Reduction

1. Problems of millions of features

slow training
easy overfitting

1.1. Solution = dimension reduction

bad: information loss
bad: complex pipeline
good: speed up training
good: data visualization (2D image or clustering)

No guarantee to filter noise, unnecessary details & better result.

1.2. DR techniques

Projection
Manifold learning
PCA
Maintain distance (LLE, tSNE)

1.3. High dimension v.s. low dimension

Low dimension

High dimension

on border

0.4% on border

99.9% on border

distance between 2 points

0.52 for 2-D

408 for 1M-D

Large dimension dataset:

(1) sparse: instances are far away from each other

(2) risk of overfitting.

2. Projection

(1) Training instances lie on lower-dimension subspace of the high-dimension space.

(2) Problem: could squash data when distribution is twisted.

3. Manifold Learning

you need to understand data distribution first

4. PCA (principle component analysis)

(1) Identify hyper-plane that lies closest to the data.

(2) Preserve variance as much as possible

4.1. Preserving maximum variance -> lose less information

find the axis which preserve the maximum variance (C1 & C2 in this case)

PC = C1(maximum variance) & C2(largest remaining variance)

4.2. SVD, single value decomposition

how to find PC? => (SVD) single value decomposition

X is the training set.

X = U \cdot \Sigma \cdot V^{T}

V = \left [ PC_{1}, PC_{2}, PC_{3}, \cdots \right ]

Projecting the training set down to d-dimension.

Wd contains first d column of V.

X_{d\_proj} = X \cdot W_{d}

4.3. Choosing the right number of dimensions

for data visualization = select 2 or 3
summation of variance >= 95%

4.4. Reconstructing error

mean squared distance between the original data & the reconstructed data

X_{reconstructed} = X_{d\_proj} \cdot W_{d}^{T}

4.5. Other PCA

randomized PCA: reduce computation complexity

incremental PCA: avoiding feeding the whole training set, but feeding by mini-batches.

Kernel PCA: map instances into a very high-dim (feature space) linear-> nonlinear

LLE (Local linear embedding): (1) measuring each training instance linearly relates to its closest neighbors (2) looking for a low-dimensional where local relationships are best preserved (3) good at unrolling twisted manifolds (4) poor scaling

t-SNE (t-Distributed Stochastic Neighbor Embedding): keep similar instances close and dissimilar instances apart.

4.6. How to choose hyper-params for Unsupervised learning (e.g. kPCA)

Unsupervised learning = no obvious performance measure
Unsupervised learning = the input for another supervised learning

Grid search to select params that lead to the best performance for the supervised task.
Lowest reconstruction error.

PreviousTerms NextHandsonML 9. Unsupervised Learning

Last updated 5 years ago

Was this helpful?