HandsonML 9. Unsupervised Learning

1. Basic

Why unsupervised:

  1. Lots of datasets don't have label.

  2. Exploit unlabeled data without human labeling

Types:

  1. dimension reduction

  2. clustering = identify similar instance & group them together

  3. anomaly detection

  4. density estimation

example of clustering: semi-supervised learning, customer segmentation, data analysis, anomaly detection, search engine.

2. K-means

Find K centoids from grouping

  1. placing k centroids randomly

  2. repeatedly label the instances (which centroid is the nearest)

  3. update centroids (center position of the group)

  4. until the centroids stop moving.

  1. Good: Guarantee to converge

  2. Good: Fast

  3. Bad: Could converge to sub-optimal solution

  4. Bad: K is predefined

2.1. Improve sub-optimal solutions

run the algorithm multiple times with different random initialization and keep the best solution.

2.2. K-means as data preprocessing

$$ need to copy image @ page 251

2.3. K-means in semi-supervised learning

  1. (random pick) supervised learning ( train on 1st 50 samples) = 83.3% accuracy

  2. k-means to identify 50 clusters (train on 50 centroid images) = 92.2% accuracy

  3. label propagation = 94.0% accuracy

  4. Full dataset (70k labeled samples) = 96.9%

Label Propagation

  1. Add label to unlabeled data by k-means clustering centroids.

  2. Good performance due to good propagated labeled accuracy ~ 99%

Active learning (co-work with human)

  1. Train a model on the labeled instances gathered so far.

  2. This model makes predictions on all unlabeled instances.

  3. The most uncertain instances are given to the expert to be labeled. (i.e., when probability is lowest)

  4. Iterate until the performance improvement stops.

3. DBSCAN

Defines clusters as continuous regions of high density

  1. For each instance, counts # of instances within a small distance ε as neighbors.

  2. An instance is a core when it has # of neighbors > n.

  3. Cluster and merge cores & their neighbors.

  1. Two hyper-params: ε and min_samples

  2. Robust to outlier

  3. ~ linear complexity

$$ add page 257, figure 9-14

4. GMM (Gaussian mixture model)

  1. A probabilistic model

  2. GMM assumes that the instances were generated from a mixture of several Gaussian distributions whose parameters are unknown.

Last updated

Was this helpful?