HandsonML 8. Dimension Reduction
1. Problems of millions of features
slow training
easy overfitting
1.1. Solution = dimension reduction
bad: information loss
bad: complex pipeline
good: speed up training
good: data visualization (2D image or clustering)
1.2. DR techniques
1.3. High dimension v.s. low dimension
Low dimension
High dimension
on border
0.4% on border
99.9% on border
distance between 2 points
0.52 for 2-D
408 for 1M-D
2. Projection
3. Manifold Learning
you need to understand data distribution first
4. PCA (principle component analysis)
4.1. Preserving maximum variance -> lose less information
find the axis which preserve the maximum variance (C1 & C2 in this case)
PC = C1(maximum variance) & C2(largest remaining variance)
4.2. SVD, single value decomposition
how to find PC? => (SVD) single value decomposition
X
is the training set.
Projecting the training set down to d-dimension.
Wd
contains first d
column of V
.
4.3. Choosing the right number of dimensions
for data visualization = select 2 or 3
summation of variance >= 95%
4.4. Reconstructing error
mean squared distance between the original data & the reconstructed data
4.5. Other PCA
randomized PCA: reduce computation complexity
incremental PCA: avoiding feeding the whole training set, but feeding by mini-batches.
Kernel PCA: map instances into a very high-dim (feature space) linear-> nonlinear
LLE (Local linear embedding): (1) measuring each training instance linearly relates to its closest neighbors (2) looking for a low-dimensional where local relationships are best preserved (3) good at unrolling twisted manifolds (4) poor scaling
t-SNE (t-Distributed Stochastic Neighbor Embedding): keep similar instances close and dissimilar instances apart.
4.6. How to choose hyper-params for Unsupervised learning (e.g. kPCA)
Unsupervised learning = no obvious performance measure
Unsupervised learning = the input for another supervised learning
Last updated
Was this helpful?